TY - GEN
T1 - Image-Text Retrieval via Contrastive Learning with Auxiliary Generative Features and Support-set Regularization
AU - Zhang, Lei
AU - Yang, Min
AU - Li, Chengming
AU - Xu, Ruifeng
N1 - Publisher Copyright:
© 2022 Owner/Author.
PY - 2022/7/7
Y1 - 2022/7/7
N2 - In this paper, we bridge the heterogeneity gap between different modalities and improve image-text retrieval by taking advantage of auxiliary image-to-text and text-to-image generative features with contrastive learning. Concretely, contrastive learning is devised to narrow the distance between the aligned image-text pairs and push apart the distance between the unaligned pairs from both inter- and intra-modality perspectives with the help of cross-modal retrieval features and auxiliary generative features. In addition, we devise a support-set regularization term to further improve contrastive learning by constraining the distance between each image/text and its corresponding cross-modal support-set information contained in the same semantic category. To evaluate the effectiveness of the proposed method, we conduct experiments on three benchmark datasets (i.e., MIRFLICKR-25K, NUS-WIDE, MS COCO). Experimental results show that our model significantly outperforms the strong baselines for cross-modal image-text retrieval. For reproducibility, we submit the code and data publicly at: \urlhttps: //github.com/Hambaobao/CRCGS.
AB - In this paper, we bridge the heterogeneity gap between different modalities and improve image-text retrieval by taking advantage of auxiliary image-to-text and text-to-image generative features with contrastive learning. Concretely, contrastive learning is devised to narrow the distance between the aligned image-text pairs and push apart the distance between the unaligned pairs from both inter- and intra-modality perspectives with the help of cross-modal retrieval features and auxiliary generative features. In addition, we devise a support-set regularization term to further improve contrastive learning by constraining the distance between each image/text and its corresponding cross-modal support-set information contained in the same semantic category. To evaluate the effectiveness of the proposed method, we conduct experiments on three benchmark datasets (i.e., MIRFLICKR-25K, NUS-WIDE, MS COCO). Experimental results show that our model significantly outperforms the strong baselines for cross-modal image-text retrieval. For reproducibility, we submit the code and data publicly at: \urlhttps: //github.com/Hambaobao/CRCGS.
KW - contrastive learning
KW - cross-modal image-text retrieval
KW - generative features
KW - support-set regularization
UR - https://www.scopus.com/pages/publications/85135015056
U2 - 10.1145/3477495.3531783
DO - 10.1145/3477495.3531783
M3 - 会议稿件
AN - SCOPUS:85135015056
T3 - SIGIR 2022 - Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval
SP - 1938
EP - 1943
BT - SIGIR 2022 - Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval
PB - Association for Computing Machinery, Inc
T2 - 45th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2022
Y2 - 11 July 2022 through 15 July 2022
ER -