Skip to main navigation Skip to search Skip to main content

Do Vision-Language Transformers Exhibit Visual Commonsense? An Empirical Study of VCR

  • Zhenyang Li
  • , Yangyang Guo
  • , Kejie Wang
  • , Xiaolin Chen
  • , Liqiang Nie*
  • , Mohan Kankanhalli
  • *Corresponding author for this work
  • Shandong University
  • National University of Singapore
  • Harbin Institute of Technology Shenzhen

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Visual Commonsense Reasoning (VCR) calls for explanatory reasoning behind question answering over visual scenes. To achieve this goal, a model is required to provide an acceptable rationale as the reason for the predicted answers. Progress on the benchmark dataset stems largely from the recent advancement of Vision-Language Transformers (VL Transformers). These models are first pre-trained on some generic large-scale vision-text datasets, and then the learned representations are transferred to the downstream VCR task. Despite their attractive performance, this paper posits that the VL Transformers do not exhibit visual commonsense, which is the key to VCR. In particular, our empirical results pinpoint several shortcomings of existing VL Transformers: small gains from pre-training, unexpected language bias, limited model architecture for the two inseparable sub-tasks, and neglect of the important object-tag correlation. With these findings, we tentatively suggest some future directions from the aspect of dataset, evaluation metric, and training tricks. We believe this work could make researchers revisit the intuition and goals of VCR, and thus help tackle the remaining challenges in visual reasoning.

Original languageEnglish
Title of host publicationMM 2023 - Proceedings of the 31st ACM International Conference on Multimedia
PublisherAssociation for Computing Machinery, Inc
Pages5634-5644
Number of pages11
ISBN (Electronic)9798400701085
DOIs
StatePublished - 27 Oct 2023
Externally publishedYes
Event31st ACM International Conference on Multimedia, MM 2023 - Ottawa, Canada
Duration: 29 Oct 20233 Nov 2023

Publication series

NameMM 2023 - Proceedings of the 31st ACM International Conference on Multimedia

Conference

Conference31st ACM International Conference on Multimedia, MM 2023
Country/TerritoryCanada
CityOttawa
Period29/10/233/11/23

Keywords

  • vision-language transformer
  • visual commonsense reasoning
  • visual question answering

Fingerprint

Dive into the research topics of 'Do Vision-Language Transformers Exhibit Visual Commonsense? An Empirical Study of VCR'. Together they form a unique fingerprint.

Cite this