Skip to main navigation Skip to search Skip to main content

Multi-modal data augmentation based on masked modeling for image–text retrieval

  • School of Computer Science and Technology, Harbin Institute of Technology
  • Ningbo University

Research output: Contribution to journalArticlepeer-review

Abstract

A large number of image–text pairs are crucial to train a precise image–text retrieval model. However, in real-world applications, small data is a common problem. As an effective strategy, data augmentation is a natural choice to deal with the small data problem. Although uni-modal data augmentation has obtained a tremendous success, multi-modal data augmentation remains a challenging task due to the semantic consistency preservation difficulty. In this paper, we propose a Multi-modal Data Augmentation framework based on Masked Modeling (MDAMM), which reconstructs the masked parts based on the intact modality and the remaining information in the masked modality by exploiting the inter-modal correlations. Furthermore, we designed a novel metric to quantitatively measure the consistency and diversity of the augmented multi-modal data. We then employ the metric to filter out some augmented data to guarantee the quality of the remaining augmented data. Experimental results on three real-world datasets demonstrate the superiority of our proposed method over traditional uni-modal and other competing multi-modal data augmentation methods.

Original languageEnglish
Article number113821
JournalKnowledge-Based Systems
Volume324
DOIs
StatePublished - 3 Aug 2025
Externally publishedYes

Keywords

  • Masked modeling
  • Multi-modal data augmentation
  • Multi-modal systems
  • Multimedia retrieval

Fingerprint

Dive into the research topics of 'Multi-modal data augmentation based on masked modeling for image–text retrieval'. Together they form a unique fingerprint.

Cite this