TY - GEN
T1 - Do Vision-Language Transformers Exhibit Visual Commonsense? An Empirical Study of VCR
AU - Li, Zhenyang
AU - Guo, Yangyang
AU - Wang, Kejie
AU - Chen, Xiaolin
AU - Nie, Liqiang
AU - Kankanhalli, Mohan
N1 - Publisher Copyright:
© 2023 ACM.
PY - 2023/10/27
Y1 - 2023/10/27
N2 - Visual Commonsense Reasoning (VCR) calls for explanatory reasoning behind question answering over visual scenes. To achieve this goal, a model is required to provide an acceptable rationale as the reason for the predicted answers. Progress on the benchmark dataset stems largely from the recent advancement of Vision-Language Transformers (VL Transformers). These models are first pre-trained on some generic large-scale vision-text datasets, and then the learned representations are transferred to the downstream VCR task. Despite their attractive performance, this paper posits that the VL Transformers do not exhibit visual commonsense, which is the key to VCR. In particular, our empirical results pinpoint several shortcomings of existing VL Transformers: small gains from pre-training, unexpected language bias, limited model architecture for the two inseparable sub-tasks, and neglect of the important object-tag correlation. With these findings, we tentatively suggest some future directions from the aspect of dataset, evaluation metric, and training tricks. We believe this work could make researchers revisit the intuition and goals of VCR, and thus help tackle the remaining challenges in visual reasoning.
AB - Visual Commonsense Reasoning (VCR) calls for explanatory reasoning behind question answering over visual scenes. To achieve this goal, a model is required to provide an acceptable rationale as the reason for the predicted answers. Progress on the benchmark dataset stems largely from the recent advancement of Vision-Language Transformers (VL Transformers). These models are first pre-trained on some generic large-scale vision-text datasets, and then the learned representations are transferred to the downstream VCR task. Despite their attractive performance, this paper posits that the VL Transformers do not exhibit visual commonsense, which is the key to VCR. In particular, our empirical results pinpoint several shortcomings of existing VL Transformers: small gains from pre-training, unexpected language bias, limited model architecture for the two inseparable sub-tasks, and neglect of the important object-tag correlation. With these findings, we tentatively suggest some future directions from the aspect of dataset, evaluation metric, and training tricks. We believe this work could make researchers revisit the intuition and goals of VCR, and thus help tackle the remaining challenges in visual reasoning.
KW - vision-language transformer
KW - visual commonsense reasoning
KW - visual question answering
UR - https://www.scopus.com/pages/publications/85179559101
U2 - 10.1145/3581783.3612395
DO - 10.1145/3581783.3612395
M3 - 会议稿件
AN - SCOPUS:85179559101
T3 - MM 2023 - Proceedings of the 31st ACM International Conference on Multimedia
SP - 5634
EP - 5644
BT - MM 2023 - Proceedings of the 31st ACM International Conference on Multimedia
PB - Association for Computing Machinery, Inc
T2 - 31st ACM International Conference on Multimedia, MM 2023
Y2 - 29 October 2023 through 3 November 2023
ER -