TY - GEN
T1 - Combating Visual Question Answering Hallucinations via Robust Multi-Space Co-Debias Learning
AU - Zhu, Jiawei
AU - Liu, Yishu
AU - Zhu, Huanjia
AU - Lin, Hui
AU - Jiang, Yuncheng
AU - Zhang, Zheng
AU - Chen, Bingzhi
N1 - Publisher Copyright:
© 2024 ACM.
PY - 2024/10/28
Y1 - 2024/10/28
N2 - The challenge of bias in visual question answering (VQA) has gained considerable attention in contemporary research. Various intricate bias dependencies, such as modalities and data imbalances, can cause semantic ambiguities to generate shifts in the feature space of VQA instances. This phenomenon is referred to as ''VQA Hallucinations''. Such distortions can cause hallucination distributions that deviate significantly from the true data, resulting in the model producing factually incorrect predictions. To address this challenge, we propose a robust Multi-Space Co-debias Learning (MSCD) approach for combating VQA hallucinations, which effectively mitigates bias-induced instance and distribution shifts in multi-space under a unified paradigm. Specifically, we design bias-aware and prior-aware debias constraints by utilizing the angle and angle margin of the spherical space to construct bias-prior-instance constraints, thereby refining the manifold representation of instance de-bias and distribution de-dependence. Moreover, we leverage the inherent overfitting characteristics of Euclidean space to introduce bias components from biased examples and modal counterexample injection, further assisting in multi-space robust learning. By integrating homeomorphic instances in different spaces, MSCD could enhance the comprehension of structural relationships between semantics and answer classes, yielding robust representations that are not solely reliant on training priors. In this way, our co-debias paradigm generates more robust representations that effectively mitigate biases to combat hallucinations. Extensive experiments on multiple benchmark datasets consistently demonstrate that the proposed MSCD method outperforms state-of-the-art baselines.
AB - The challenge of bias in visual question answering (VQA) has gained considerable attention in contemporary research. Various intricate bias dependencies, such as modalities and data imbalances, can cause semantic ambiguities to generate shifts in the feature space of VQA instances. This phenomenon is referred to as ''VQA Hallucinations''. Such distortions can cause hallucination distributions that deviate significantly from the true data, resulting in the model producing factually incorrect predictions. To address this challenge, we propose a robust Multi-Space Co-debias Learning (MSCD) approach for combating VQA hallucinations, which effectively mitigates bias-induced instance and distribution shifts in multi-space under a unified paradigm. Specifically, we design bias-aware and prior-aware debias constraints by utilizing the angle and angle margin of the spherical space to construct bias-prior-instance constraints, thereby refining the manifold representation of instance de-bias and distribution de-dependence. Moreover, we leverage the inherent overfitting characteristics of Euclidean space to introduce bias components from biased examples and modal counterexample injection, further assisting in multi-space robust learning. By integrating homeomorphic instances in different spaces, MSCD could enhance the comprehension of structural relationships between semantics and answer classes, yielding robust representations that are not solely reliant on training priors. In this way, our co-debias paradigm generates more robust representations that effectively mitigate biases to combat hallucinations. Extensive experiments on multiple benchmark datasets consistently demonstrate that the proposed MSCD method outperforms state-of-the-art baselines.
KW - multi-space learning
KW - robust learning
KW - visual question answering
KW - vqa hallucinations
UR - https://www.scopus.com/pages/publications/85209776784
U2 - 10.1145/3664647.3681663
DO - 10.1145/3664647.3681663
M3 - 会议稿件
AN - SCOPUS:85209776784
T3 - MM 2024 - Proceedings of the 32nd ACM International Conference on Multimedia
SP - 955
EP - 964
BT - MM 2024 - Proceedings of the 32nd ACM International Conference on Multimedia
PB - Association for Computing Machinery, Inc
T2 - 32nd ACM International Conference on Multimedia, MM 2024
Y2 - 28 October 2024 through 1 November 2024
ER -