TY - GEN
T1 - Focal and Composed Vision-semantic Modeling for Visual Question Answering
AU - Han, Yudong
AU - Guo, Yangyang
AU - Yin, Jianhua
AU - Liu, Meng
AU - Hu, Yupeng
AU - Nie, Liqiang
N1 - Publisher Copyright:
© 2021 ACM.
PY - 2021/10/17
Y1 - 2021/10/17
N2 - Visual Question Answering (VQA) is a vital yet challenging task in the field of multimedia comprehension. In order to correctly answer questions about an image, a VQA model requires to sufficiently understand the visual scene, especially the vision-semantic reasonings between the two modalities. Traditional relation-based methods allow to encode the pairwise relations of objects to boost the VQA model performance. However, this simple strategy is deficient to exploit the abundant concepts expressed by the composition of diverse image objects, leading to sub-optimal performance. In this paper, we propose a focal and composed vision-semantic modeling method, which is a trainable end-to-end model, for better vision-semantic redundancy removal and compositionality modeling. Concretely, we first introduce the LENA cell, a plug-and-play reasoning module, which removes redundant semantic by a focal mechanism in the first step, followed by the vision-semantic compositionality modeling for better visual reasoning. We then incorporate the cell into a full LENA network, which progressively refines multimodal composed representations, and can be leveraged to infer the high-order vision-semantic in a multi-step learning way. Extensive experiments on two benchmark datasets, i.e., VQA v2 and VQA-CP v2, verify the superiority of our model as compared with several state-of-the-art baselines.
AB - Visual Question Answering (VQA) is a vital yet challenging task in the field of multimedia comprehension. In order to correctly answer questions about an image, a VQA model requires to sufficiently understand the visual scene, especially the vision-semantic reasonings between the two modalities. Traditional relation-based methods allow to encode the pairwise relations of objects to boost the VQA model performance. However, this simple strategy is deficient to exploit the abundant concepts expressed by the composition of diverse image objects, leading to sub-optimal performance. In this paper, we propose a focal and composed vision-semantic modeling method, which is a trainable end-to-end model, for better vision-semantic redundancy removal and compositionality modeling. Concretely, we first introduce the LENA cell, a plug-and-play reasoning module, which removes redundant semantic by a focal mechanism in the first step, followed by the vision-semantic compositionality modeling for better visual reasoning. We then incorporate the cell into a full LENA network, which progressively refines multimodal composed representations, and can be leveraged to infer the high-order vision-semantic in a multi-step learning way. Extensive experiments on two benchmark datasets, i.e., VQA v2 and VQA-CP v2, verify the superiority of our model as compared with several state-of-the-art baselines.
KW - vision-semantic compositionality
KW - vision-semantic redundancy
KW - visual question answering
UR - https://www.scopus.com/pages/publications/85119349560
U2 - 10.1145/3474085.3475609
DO - 10.1145/3474085.3475609
M3 - 会议稿件
AN - SCOPUS:85119349560
T3 - MM 2021 - Proceedings of the 29th ACM International Conference on Multimedia
SP - 4528
EP - 4536
BT - MM 2021 - Proceedings of the 29th ACM International Conference on Multimedia
PB - Association for Computing Machinery, Inc
T2 - 29th ACM International Conference on Multimedia, MM 2021
Y2 - 20 October 2021 through 24 October 2021
ER -