Abstract
Synthetic Aperture Radar (SAR), as an active microwave remote sensing system, offers all-weather, all-day observation capabilities and has considerable application value in disaster monitoring, urban management, and military reconnaissance. Although deep learning techniques have achieved remarkable progress in interpreting SAR images, existing methods for target recognition and detection primarily focus on local feature extraction and single-target discrimination. They struggle to comprehensively characterize the global semantic structure and multitarget relationships in complex scenes, and the interpretation process remains highly dependent on human expertise with limited automation. SAR image captioning aims to translate visual information into natural language, serving as a key technology to bridge the gap between “perceiving targets” and “cognizing scenes,” which is of great importance for enhancing the automation and intelligence of SAR image interpretation. However, the inherent speckle noise, the scarcity of textural details, and the substantial semantic gap in SAR images further exacerbate the difficulty of cross-modal understanding. To address these challenges, this paper proposes a spatial-frequency aware model for SAR image captioning. First, a spatial-frequency aware module is constructed. It employs a Discrete Cosine Transform (DCT) mask attention mechanism to reweight spectral components for noise suppression and structure enhancement, combined with a Gabor multiscale texture enhancement submodule to improve sensitivity to directional and edge details. Second, a cross-modal semantic enhancement loss function is designed to bridge the semantic gap between visual features and natural language through bidirectional image-text alignment and mutual information maximization. Furthermore, a large-scale fine-grained SAR image captioning dataset, FSAR-Cap, containing 72400 high-quality image-text pairs, is constructed. The experimental results demonstrate that the proposed method achieves CIDEr scores of 151.00 and 95.14 on the SARLANG and FSAR-Cap datasets, respectively. Qualitatively, the model effectively suppresses hallucinations and accurately captures fine-grained spatial-textural details, considerably outperforming mainstream methods.
| Translated title of the contribution | DGS-CapNet: A Spatial-frequency-aware Model for SAR Image Captioning |
|---|---|
| Original language | Chinese (Traditional) |
| Pages (from-to) | 441-462 |
| Number of pages | 22 |
| Journal | Journal of Radars |
| Volume | 15 |
| Issue number | 2 |
| DOIs | |
| State | Published - Apr 2026 |
| Externally published | Yes |
UN SDGs
This output contributes to the following UN Sustainable Development Goals (SDGs)
-
SDG 11 Sustainable Cities and Communities
Fingerprint
Dive into the research topics of 'DGS-CapNet: A Spatial-frequency-aware Model for SAR Image Captioning'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver