Skip to main navigation Skip to search Skip to main content

VQA4CIR: Boosting Composed Image Retrieval with Visual Question Answering

  • Chun Mei Feng
  • , Yang Bai*
  • , Tao Luo
  • , Zhen Li
  • , Salman Khan
  • , Wangmeng Zuo
  • , Rick Siow Mong Goh
  • , Yong Liu
  • *Corresponding author for this work
  • Agency for Science, Technology and Research, Singapore
  • The Chinese University of Hong Kong, Shenzhen
  • Mohamed Bin Zayed University of Artificial Intelligence
  • Australian National University

Research output: Contribution to journalConference articlepeer-review

Abstract

Albeit progress has been made in Composed Image Retrieval (CIR), we empirically find that a certain percentage of failure retrieval results are not consistent with their relative captions. To address this issue, this work provides a Visual Question Answering (VQA) perspective to boost the performance of CIR. The resulting VQA4CIR is a post-processing approach and can be directly plugged into existing CIR methods. Given the top-C retrieved images by a CIR method, VQA4CIR aims to decrease the adverse effect of the failure retrieval results being inconsistent with the relative caption. To find the retrieved images inconsistent with the relative caption, we resort to the”QA generation → VQA” self-verification pipeline. For QA generation, we suggest fine-tuning LLM (e.g., LLaMA) to generate several pairs of questions and answers from each relative caption. We then fine-tune LVLM (e.g., LLaVA) to obtain the VQA model. By feeding the retrieved image and question to the VQA model, one can find the images inconsistent with relative caption when the answer by VQA is inconsistent with the answer in the QA pair. Consequently, the CIR performance can be boosted by modifying the ranks of inconsistently retrieved images. Experimental results show that our proposed method outperforms state-of-the-art CIR methods on the CIRR and Fashion-IQ datasets.

Original languageEnglish
Pages (from-to)2942-2950
Number of pages9
JournalProceedings of the AAAI Conference on Artificial Intelligence
Volume39
Issue number3
DOIs
StatePublished - 11 Apr 2025
Event39th Annual AAAI Conference on Artificial Intelligence, AAAI 2025 - Philadelphia, United States
Duration: 25 Feb 20254 Mar 2025

Fingerprint

Dive into the research topics of 'VQA4CIR: Boosting Composed Image Retrieval with Visual Question Answering'. Together they form a unique fingerprint.

Cite this