Skip to main navigation Skip to search Skip to main content

SegCLIP: Multimodal Visual-Language and Prompt Learning for High-Resolution Remote Sensing Semantic Segmentation

  • Shijie Zhang
  • , Bin Zhang*
  • , Yuntao Wu
  • , Huabing Zhou
  • , Junjun Jiang
  • , Jiayi Ma
  • *Corresponding author for this work
  • Wuhan Institute of Technology
  • School of Computer Science and Technology, Harbin Institute of Technology
  • Wuhan University

Research output: Contribution to journalArticlepeer-review

Abstract

Remote sensing semantic segmentation is considered a key step in the intelligent interpretation of high-resolution remote sensing (HRRS) images, with widespread applications in fields such as hazard assessment, environmental monitoring, and urban planning. Recently, numerous deep learning-based semantic segmentation methods have emerged, achieving significant breakthroughs. However, the majority of current research still concentrates on representation learning in the visual feature space, with the potential of multimodal data sources yet to be fully explored. In recent years, the foundational visual language model, namely contrastive language-image pretraining (CLIP), has established a new paradigm in the visual field, demonstrating excellent generalization capabilities and deep semantic understanding across a variety of tasks. Inspired by prompt learning, we propose a prompting approach based on linguistic descriptions to enable CLIP to generate semantically distinct contextual information for remote sensing images. We introduce the SegCLIP network architecture, a novel framework specifically designed for semantic segmentation of HRRS images. Specifically, we have adapted CLIP to extract text information, thereby guiding the visual model in distinguishing among classes. Additionally, we have designed a cross-modal feature fusion (CFF) module that integrates linguistic and visual semantic features, ensuring semantic consistency across modalities. Finally, we have fully exploited the potential of text data and have used additional real text to refine ambiguous query features. Experimental evaluations confirm that the method exhibits superior performance on the LoveDA, iSAID, and UAVid public semantic segmentation datasets.

Original languageEnglish
Article number5646316
JournalIEEE Transactions on Geoscience and Remote Sensing
Volume62
DOIs
StatePublished - 2024
Externally publishedYes

Keywords

  • Attention mechanism
  • contrastive language-image pretraining (CLIP)
  • prompt learning
  • remote sensing
  • semantic segmentation

Fingerprint

Dive into the research topics of 'SegCLIP: Multimodal Visual-Language and Prompt Learning for High-Resolution Remote Sensing Semantic Segmentation'. Together they form a unique fingerprint.

Cite this