Abstract
As one of the most fundamental techniques in multimodal learning, cross-modal matching aims to project various sensory modalities into a shared feature space. To achieve this, massive and correctly aligned data pairs are required for model training. However, unlike unimodal datasets, multimodal datasets are extremely harder to collect and annotate precisely. As an alternative, the co-occurred data pairs (e.g., image-text pairs) collected from the Internet have been widely exploited in the area. Unfortunately, the cheaply collected dataset unavoidably contains many mismatched data pairs, which have been proven to be harmful to the model's performance. To address this, we propose BiCro++ (Improved Bidirectional Cross-modal Similarity Consistency). This module can be integrated into existing cross-modal matching models, enhancing their robustness against noisy data through self-adaptive soft labels that dynamically reflect the true correspondence of data pairs. The basic idea of BiCro++ is motivated by that - taking image-text matching as an example - similar images should have similar textual descriptions and vice versa. This bidirectional similarity consistency can be directly translated into soft labels as a self-supervision signal to train the matching model. To further refine soft label quality, BiCro++ first introduces a Diagonal-Dominance Purification process to identify reliable anchor points from noisy dataset as the reference for soft label estimation. Then it employs a Hybrid-level Codebook Alignment mechanism that establishes enhanced consistency in bidirectional cross-modal similarity. The experiments on three popular cross-modal matching datasets show that our method significantly improves the noise-robustness of various matching models, and surpasses the state-of-the-art method by an average of 5.3%, 3.1% and 6.4% in terms of recall, respectively.
| Original language | English |
|---|---|
| Pages (from-to) | 4657-4672 |
| Number of pages | 16 |
| Journal | IEEE Transactions on Pattern Analysis and Machine Intelligence |
| Volume | 48 |
| Issue number | 4 |
| DOIs | |
| State | Published - 2026 |
| Externally published | Yes |
Keywords
- Cross-modal matching
- noise-robust learning
- noisy correspondence
Fingerprint
Dive into the research topics of 'Noisy Correspondence Rectification in Multimodal Clustering Space for Cross-Modal Matching'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver