Skip to main navigation Skip to search Skip to main content

VTQA: Visual Text Question Answering via Entity Alignment and Cross-Media Reasoning

  • Faculty of Computing, Harbin Institute of Technology

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Achieving the optimal form of Visual Question Answering mandates a profound grasp of understanding, grounding, and reasoning within the intersecting domains of vision and language. Traditional VQA benchmarks have predom-inantly focused on simplistic tasks such as counting, visual attributes, and object detection, which do not necessitate intricate cross-modal information understanding and inference. Motivated by the need for a more comprehensive evaluation, we introduce a novel dataset comprising 23,781 questions derived from 10,124 image-text pairs. Specifically, the task of this dataset requires the model to align multimedia representations of the same entity to implement multi-hop reasoning between image and text and finally use natural language to answer the question. Furthermore, we evaluate this VTQA dataset, comparing the performance of both state-of-the-art VQA models and our proposed base-line model, the Key Entity Cross-Media Reasoning Network (KECMRN). The VTQA task poses formidable challenges for traditional VQA models, underscoring its intrinsic complexity. Conversely, KECMRN exhibits a modest improvement, signifying its potential in multimedia entity alignment and multi-step reasoning. Our analysis underscores the diversity, difficulty, and scale of the VTQA task compared to previous multimodal QA datasets. In conclusion, we anticipate that this dataset will serve as a pivotal resource for advancing and evaluating models proficient in multime-dia entity alignment, multi-step reasoning, and open-ended answer generation. Our dataset and code is available at https://visual-text-qa.github.io/

Original languageEnglish
Title of host publicationProceedings - 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024
PublisherIEEE Computer Society
Pages27208-27217
Number of pages10
ISBN (Electronic)9798350353006
DOIs
StatePublished - 2024
Externally publishedYes
Event2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024 - Seattle, United States
Duration: 16 Jun 202422 Jun 2024

Publication series

NameProceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition
ISSN (Print)1063-6919

Conference

Conference2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024
Country/TerritoryUnited States
CitySeattle
Period16/06/2422/06/24

Keywords

  • artificial intelligence
  • dataset
  • multimodal
  • visual question answering

Fingerprint

Dive into the research topics of 'VTQA: Visual Text Question Answering via Entity Alignment and Cross-Media Reasoning'. Together they form a unique fingerprint.

Cite this