Harnessing Representative Spatial-Temporal Information for Video Question Answering

  • Yuanyuan Wang*
  • , Meng Liu
  • , Xuemeng Song
  • , Liqiang Nie
  • *Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

Abstract

Video question answering, aiming to answer a natural language question related to the given video, has become prevalent in the past few years. Although remarkable improvements have been obtained, it is still exposed to the challenge of insufficient comprehension of video content. To this end, we propose a spatial-temporal representative visual exploitation network for video question answering, which enhances the understanding of the video by merely summarizing representative visual information. In order to explore representative object information, we advance adaptive attention based on uncertainty estimation. At the same time, to capture representative frame-level and clip-level visual information, we structure a much more compact set of representations iteratively in an expectation-maximization manner to deprecate noisy information. Both the quantitative and qualitative results on NExT-QA, TGIF-QA, MSRVTT-QA, and MSVD-QA datasets demonstrate the superiority of our model over several state-of-the-art approaches.

Original languageEnglish
Article number306
JournalACM Transactions on Multimedia Computing, Communications and Applications
Volume20
Issue number10
DOIs
StatePublished - 29 Oct 2024
Externally publishedYes

Keywords

  • Video question answering
  • expectation-maximization attention
  • uncertainty estimation

Fingerprint

Dive into the research topics of 'Harnessing Representative Spatial-Temporal Information for Video Question Answering'. Together they form a unique fingerprint.

Cite this