Abstract
Fine-grained cross-modal retrieval is a prominent research focus in the fields of information retrieval and multi-modal learning. Existing methods lack local fragment matching for image–text semantic similarity. This results in poor performance when measuring local similarity between image regions and words. Furthermore, they fail to explore relationships between local fragments on a larger scale. As a result, models struggle with distinguishing complex scenarios where identical local fragments have differing relationships. Therefore, we propose Combination of Phrase Matchings (CPM) based cross-modal retrieval. Specifically, we introduce stacked GCNs at different layers to model non-neighboring phrases within various ranges. This is done to compute image–text semantic similarity at complex scenario level. Simultaneously, we combine local fragment matching, neighborhood phrase matching and multiple non-neighboring phrase matching (referred to collectively as phrase matching) to more comprehensively reflect image–text semantic similarity. We conducted extensive experiments on the MS-COCO and Flicker30K datasets, which demonstrated the superior performance of CPM. The experiments also highlighted the roles of local fragment matching, neighboring phrase matching, and non-neighboring phrase matching at different scales in enhancing image–text matching.
| Original language | English |
|---|---|
| Article number | 130174 |
| Journal | Neurocomputing |
| Volume | 638 |
| DOIs | |
| State | Published - 14 Jul 2025 |
| Externally published | Yes |
Keywords
- Cross-modal retrieval
- Graph convolutional network
- Image–text matching
- Phrase matching
Fingerprint
Dive into the research topics of 'Combination of Phrase Matchings based cross-modal retrieval'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver