Skip to main navigation Skip to search Skip to main content

Combating Visual Question Answering Hallucinations via Robust Multi-Space Co-Debias Learning

  • Jiawei Zhu
  • , Yishu Liu
  • , Huanjia Zhu
  • , Hui Lin*
  • , Yuncheng Jiang
  • , Zheng Zhang
  • , Bingzhi Chen*
  • *Corresponding author for this work
  • Beijing Institute of Technology
  • Harbin Institute of Technology Shenzhen
  • South China Normal University
  • China Academic of Electronics and Information Technology

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

The challenge of bias in visual question answering (VQA) has gained considerable attention in contemporary research. Various intricate bias dependencies, such as modalities and data imbalances, can cause semantic ambiguities to generate shifts in the feature space of VQA instances. This phenomenon is referred to as ''VQA Hallucinations''. Such distortions can cause hallucination distributions that deviate significantly from the true data, resulting in the model producing factually incorrect predictions. To address this challenge, we propose a robust Multi-Space Co-debias Learning (MSCD) approach for combating VQA hallucinations, which effectively mitigates bias-induced instance and distribution shifts in multi-space under a unified paradigm. Specifically, we design bias-aware and prior-aware debias constraints by utilizing the angle and angle margin of the spherical space to construct bias-prior-instance constraints, thereby refining the manifold representation of instance de-bias and distribution de-dependence. Moreover, we leverage the inherent overfitting characteristics of Euclidean space to introduce bias components from biased examples and modal counterexample injection, further assisting in multi-space robust learning. By integrating homeomorphic instances in different spaces, MSCD could enhance the comprehension of structural relationships between semantics and answer classes, yielding robust representations that are not solely reliant on training priors. In this way, our co-debias paradigm generates more robust representations that effectively mitigate biases to combat hallucinations. Extensive experiments on multiple benchmark datasets consistently demonstrate that the proposed MSCD method outperforms state-of-the-art baselines.

Original languageEnglish
Title of host publicationMM 2024 - Proceedings of the 32nd ACM International Conference on Multimedia
PublisherAssociation for Computing Machinery, Inc
Pages955-964
Number of pages10
ISBN (Electronic)9798400706868
DOIs
StatePublished - 28 Oct 2024
Externally publishedYes
Event32nd ACM International Conference on Multimedia, MM 2024 - Melbourne, Australia
Duration: 28 Oct 20241 Nov 2024

Publication series

NameMM 2024 - Proceedings of the 32nd ACM International Conference on Multimedia

Conference

Conference32nd ACM International Conference on Multimedia, MM 2024
Country/TerritoryAustralia
CityMelbourne
Period28/10/241/11/24

Keywords

  • multi-space learning
  • robust learning
  • visual question answering
  • vqa hallucinations

Fingerprint

Dive into the research topics of 'Combating Visual Question Answering Hallucinations via Robust Multi-Space Co-Debias Learning'. Together they form a unique fingerprint.

Cite this