Abstract
A large number of image–text pairs are crucial to train a precise image–text retrieval model. However, in real-world applications, small data is a common problem. As an effective strategy, data augmentation is a natural choice to deal with the small data problem. Although uni-modal data augmentation has obtained a tremendous success, multi-modal data augmentation remains a challenging task due to the semantic consistency preservation difficulty. In this paper, we propose a Multi-modal Data Augmentation framework based on Masked Modeling (MDAMM), which reconstructs the masked parts based on the intact modality and the remaining information in the masked modality by exploiting the inter-modal correlations. Furthermore, we designed a novel metric to quantitatively measure the consistency and diversity of the augmented multi-modal data. We then employ the metric to filter out some augmented data to guarantee the quality of the remaining augmented data. Experimental results on three real-world datasets demonstrate the superiority of our proposed method over traditional uni-modal and other competing multi-modal data augmentation methods.
| Original language | English |
|---|---|
| Article number | 113821 |
| Journal | Knowledge-Based Systems |
| Volume | 324 |
| DOIs | |
| State | Published - 3 Aug 2025 |
| Externally published | Yes |
Keywords
- Masked modeling
- Multi-modal data augmentation
- Multi-modal systems
- Multimedia retrieval
Fingerprint
Dive into the research topics of 'Multi-modal data augmentation based on masked modeling for image–text retrieval'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver