Skip to main navigation Skip to search Skip to main content

DGS-CapNet:基于空间-频率感知的SAR图像描述模型

Translated title of the contribution: DGS-CapNet: A Spatial-frequency-aware Model for SAR Image Captioning
  • School of Electronics and Information Engineering, Harbin Institute of Technology
  • CAS - Institute of Software
  • CAS - Aerospace Information Research Institute

Research output: Contribution to journalArticlepeer-review

Abstract

Synthetic Aperture Radar (SAR), as an active microwave remote sensing system, offers all-weather, all-day observation capabilities and has considerable application value in disaster monitoring, urban management, and military reconnaissance. Although deep learning techniques have achieved remarkable progress in interpreting SAR images, existing methods for target recognition and detection primarily focus on local feature extraction and single-target discrimination. They struggle to comprehensively characterize the global semantic structure and multitarget relationships in complex scenes, and the interpretation process remains highly dependent on human expertise with limited automation. SAR image captioning aims to translate visual information into natural language, serving as a key technology to bridge the gap between “perceiving targets” and “cognizing scenes,” which is of great importance for enhancing the automation and intelligence of SAR image interpretation. However, the inherent speckle noise, the scarcity of textural details, and the substantial semantic gap in SAR images further exacerbate the difficulty of cross-modal understanding. To address these challenges, this paper proposes a spatial-frequency aware model for SAR image captioning. First, a spatial-frequency aware module is constructed. It employs a Discrete Cosine Transform (DCT) mask attention mechanism to reweight spectral components for noise suppression and structure enhancement, combined with a Gabor multiscale texture enhancement submodule to improve sensitivity to directional and edge details. Second, a cross-modal semantic enhancement loss function is designed to bridge the semantic gap between visual features and natural language through bidirectional image-text alignment and mutual information maximization. Furthermore, a large-scale fine-grained SAR image captioning dataset, FSAR-Cap, containing 72400 high-quality image-text pairs, is constructed. The experimental results demonstrate that the proposed method achieves CIDEr scores of 151.00 and 95.14 on the SARLANG and FSAR-Cap datasets, respectively. Qualitatively, the model effectively suppresses hallucinations and accurately captures fine-grained spatial-textural details, considerably outperforming mainstream methods.

Translated title of the contributionDGS-CapNet: A Spatial-frequency-aware Model for SAR Image Captioning
Original languageChinese (Traditional)
Pages (from-to)441-462
Number of pages22
JournalJournal of Radars
Volume15
Issue number2
DOIs
StatePublished - Apr 2026
Externally publishedYes

UN SDGs

This output contributes to the following UN Sustainable Development Goals (SDGs)

  1. SDG 11 - Sustainable Cities and Communities
    SDG 11 Sustainable Cities and Communities

Fingerprint

Dive into the research topics of 'DGS-CapNet: A Spatial-frequency-aware Model for SAR Image Captioning'. Together they form a unique fingerprint.

Cite this