Skip to main navigation Skip to search Skip to main content

Noisy Correspondence Rectification in Multimodal Clustering Space for Cross-Modal Matching

  • Shuo Yang
  • , Yancheng Long
  • , Yujie Wei
  • , Zeke Xie
  • , Hongxun Yao
  • , Min Xu
  • , Liqiang Nie*
  • *Corresponding author for this work
  • Harbin Institute of Technology
  • Hong Kong University of Science and Technology
  • University of Technology Sydney

Research output: Contribution to journalArticlepeer-review

Abstract

As one of the most fundamental techniques in multimodal learning, cross-modal matching aims to project various sensory modalities into a shared feature space. To achieve this, massive and correctly aligned data pairs are required for model training. However, unlike unimodal datasets, multimodal datasets are extremely harder to collect and annotate precisely. As an alternative, the co-occurred data pairs (e.g., image-text pairs) collected from the Internet have been widely exploited in the area. Unfortunately, the cheaply collected dataset unavoidably contains many mismatched data pairs, which have been proven to be harmful to the model's performance. To address this, we propose BiCro++ (Improved Bidirectional Cross-modal Similarity Consistency). This module can be integrated into existing cross-modal matching models, enhancing their robustness against noisy data through self-adaptive soft labels that dynamically reflect the true correspondence of data pairs. The basic idea of BiCro++ is motivated by that - taking image-text matching as an example - similar images should have similar textual descriptions and vice versa. This bidirectional similarity consistency can be directly translated into soft labels as a self-supervision signal to train the matching model. To further refine soft label quality, BiCro++ first introduces a Diagonal-Dominance Purification process to identify reliable anchor points from noisy dataset as the reference for soft label estimation. Then it employs a Hybrid-level Codebook Alignment mechanism that establishes enhanced consistency in bidirectional cross-modal similarity. The experiments on three popular cross-modal matching datasets show that our method significantly improves the noise-robustness of various matching models, and surpasses the state-of-the-art method by an average of 5.3%, 3.1% and 6.4% in terms of recall, respectively.

Original languageEnglish
Pages (from-to)4657-4672
Number of pages16
JournalIEEE Transactions on Pattern Analysis and Machine Intelligence
Volume48
Issue number4
DOIs
StatePublished - 2026
Externally publishedYes

Keywords

  • Cross-modal matching
  • noise-robust learning
  • noisy correspondence

Fingerprint

Dive into the research topics of 'Noisy Correspondence Rectification in Multimodal Clustering Space for Cross-Modal Matching'. Together they form a unique fingerprint.

Cite this