Skip to main navigation Skip to search Skip to main content

Combination of Phrase Matchings based cross-modal retrieval

  • Li Zhang*
  • , Yahu Yang
  • , Shuheng Ge
  • , Guanghui Sun
  • *Corresponding author for this work
  • Faculty of Computing, Harbin Institute of Technology
  • School of Astronautics, Harbin Institute of Technology

Research output: Contribution to journalArticlepeer-review

Abstract

Fine-grained cross-modal retrieval is a prominent research focus in the fields of information retrieval and multi-modal learning. Existing methods lack local fragment matching for image–text semantic similarity. This results in poor performance when measuring local similarity between image regions and words. Furthermore, they fail to explore relationships between local fragments on a larger scale. As a result, models struggle with distinguishing complex scenarios where identical local fragments have differing relationships. Therefore, we propose Combination of Phrase Matchings (CPM) based cross-modal retrieval. Specifically, we introduce stacked GCNs at different layers to model non-neighboring phrases within various ranges. This is done to compute image–text semantic similarity at complex scenario level. Simultaneously, we combine local fragment matching, neighborhood phrase matching and multiple non-neighboring phrase matching (referred to collectively as phrase matching) to more comprehensively reflect image–text semantic similarity. We conducted extensive experiments on the MS-COCO and Flicker30K datasets, which demonstrated the superior performance of CPM. The experiments also highlighted the roles of local fragment matching, neighboring phrase matching, and non-neighboring phrase matching at different scales in enhancing image–text matching.

Original languageEnglish
Article number130174
JournalNeurocomputing
Volume638
DOIs
StatePublished - 14 Jul 2025
Externally publishedYes

Keywords

  • Cross-modal retrieval
  • Graph convolutional network
  • Image–text matching
  • Phrase matching

Fingerprint

Dive into the research topics of 'Combination of Phrase Matchings based cross-modal retrieval'. Together they form a unique fingerprint.

Cite this