Abstract
Recent years have witnessed the rapid growth of multimedia data, such as texts and images, inducing many researchers to work on multimodal representation, understanding, and reasoning. As a fundamental task of multimodal interaction, image-text matching, focusing on measuring the semantic similarity between an image and a text, has attracted extensive research attention. It indeed facilitates various applications, such as cross-modal retrieval, visual question answering, and multimedia understanding, and plays a critical role in bridging vision and language. Recently, deep learning techniques have emerged as powerful methods for various tasks. This motivates many researchers to resort to deep learning approaches to tackle the image-text matching task. Particularly, great progress has been made by exploiting the global alignment between images and sentences, or local alignments between image regions and textual words. They can be roughly divided into the following categories: global representation-based image-text matching methods, local representation-based image-text matching methods, external knowledge-based image-text matching methods, metric learning-based image-text matching methods, and multimodal pre-training models. To be specific, global representation-based image-text matching methods usually realize cross-modal matching by measuring the semantic similarity between the global image and text representations; local representation-based image-text matching methods focus on modeling fine-grained correlations between visual and textual entities; external knowledge-based image-text matching methods are devoted to acquire certain prior knowledge from external sources, such as scene graph, to improve the accuracy of image-text matching; metric learning-based image-text matching methods try to explore a better constraint or similarity measurement to improve the discriminability between unpaired samples and the relevance between the paired samples; as well as the multimodal pre-training models including single stream and two stream frameworks have strong generalization ability. To give a comprehensive overview of this field, including models, datasets, and future directions, we summarize the work on image-text matching and present this survey. Specifically, to perform a deeper analysis of existing approaches, we establish the fine-grained taxonomy of each category. For instance, for global representation-based image-text matching methods, we further divide them into two categories according to their architectures: embedding-based methods and interaction-based methods, respectively. Thereinto, embedding-based methods directly constrain the representation learning of images and text in the common space, while interaction-based methods exploit the cross-modal interactive information for better semantic matching. As to local feature-based image-text matching methods, we further divide them into three categories according to interaction patterns: intra-modal modeling, inter-modal modeling, and hybrid interaction modeling-based approaches. More concretely, intra-modal modeling-based image-text matching methods independently explore relationships between entities within a particular modality, and inter-modal modeling-based image-text matching methods explore cross-modal relationships to better align visual and textual semantic information. Differently, hybrid interaction modeling-based approaches consider both cross-modal interaction information modeling and intra-modal correlation modeling, to simultaneously enhance the modeling of intra-modal and inter-modal relationships. Subsequently, we summarize several benchmark image-text matching datasets and analyze the experimental results of existing models. In addition, we also introduce some related research tasks, including weakly-supervised cross-modal matching, zero-shot cross-modal matching, cross-linguistic image retrieval, and scene-text aware cross-modal retrieval. Finally, we discuss promising future directions for this task, in particular standard dataset partitioning, interpretable image-text matching models, and efficient image-text matching models.
| Translated title of the contribution | 基于深度学习的图像-文本匹配研究综述 |
|---|---|
| Original language | English |
| Pages (from-to) | 2370-2399 |
| Number of pages | 30 |
| Journal | Jisuanji Xuebao/Chinese Journal of Computers |
| Volume | 46 |
| Issue number | 11 |
| DOIs | |
| State | Published - Nov 2023 |
| Externally published | Yes |
Keywords
- artificial intelligence
- cross-modal image retrieval
- deep learning
- image-text matching
- multimodal pre-training model
- survey
Fingerprint
Dive into the research topics of 'A Survey on Deeр Learning Based Image-Text Matching'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver