Improving Cross-Modal Image-Text Retrieval with Teacher-Student Learning

  • Junhao Liu
  • , Min Yang*
  • , Chengming Li
  • , Ruifeng Xu
  • *Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

Abstract

Cross-modal image-text retrieval has emerged as a challenging task that requires the multimedia system to bridge the heterogeneity gap between different modalities. In this paper, we take full advantage of image-to-text and text-to-image generation models to improve the performance of the cross-modal image-text retrieval model by incorporating the text-grounded and image-grounded generative features into the cross-modal common space with a 'Two-Teacher One-Student' learning framework. In addition, a dual regularizer network is designed to distinguish the mismatched image-text pairs from the matched ones. In this way, we can capture the fine-grained correspondence between modalities and distinguish the best-retrieved result from a candidate set. Extensive experiments on three benchmark datasets (i.e., MIRFLICKR-25K, NUS-WIDE, and MS COCO) show that our model can achieve state-of-the-art cross-modal retrieval results. In particular, our model improves the image-to-text and text-to-image retrieval accuracy by more than 22% over the best competitors on the MS COCO dataset.

Original languageEnglish
Article number9257382
Pages (from-to)3242-3253
Number of pages12
JournalIEEE Transactions on Circuits and Systems for Video Technology
Volume31
Issue number8
DOIs
StatePublished - Aug 2021
Externally publishedYes

Keywords

  • Cross-modal image-text retrieval
  • Image-to-text generation
  • Teacher-student learning
  • Text-to-image generation

Fingerprint

Dive into the research topics of 'Improving Cross-Modal Image-Text Retrieval with Teacher-Student Learning'. Together they form a unique fingerprint.

Cite this