Abstract
Cross-modal image-text retrieval has emerged as a challenging task that requires the multimedia system to bridge the heterogeneity gap between different modalities. In this paper, we take full advantage of image-to-text and text-to-image generation models to improve the performance of the cross-modal image-text retrieval model by incorporating the text-grounded and image-grounded generative features into the cross-modal common space with a 'Two-Teacher One-Student' learning framework. In addition, a dual regularizer network is designed to distinguish the mismatched image-text pairs from the matched ones. In this way, we can capture the fine-grained correspondence between modalities and distinguish the best-retrieved result from a candidate set. Extensive experiments on three benchmark datasets (i.e., MIRFLICKR-25K, NUS-WIDE, and MS COCO) show that our model can achieve state-of-the-art cross-modal retrieval results. In particular, our model improves the image-to-text and text-to-image retrieval accuracy by more than 22% over the best competitors on the MS COCO dataset.
| Original language | English |
|---|---|
| Article number | 9257382 |
| Pages (from-to) | 3242-3253 |
| Number of pages | 12 |
| Journal | IEEE Transactions on Circuits and Systems for Video Technology |
| Volume | 31 |
| Issue number | 8 |
| DOIs | |
| State | Published - Aug 2021 |
| Externally published | Yes |
Keywords
- Cross-modal image-text retrieval
- Image-to-text generation
- Teacher-student learning
- Text-to-image generation
Fingerprint
Dive into the research topics of 'Improving Cross-Modal Image-Text Retrieval with Teacher-Student Learning'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver