Skip to main navigation Skip to search Skip to main content

Multi-Granularity Contrastive Cross-Modal Collaborative Generation for End-to-End Long-Term Video Question Answering

  • Ting Yu
  • , Kunhao Fu
  • , Jian Zhang
  • , Qingming Huang
  • , Jun Yu*
  • *Corresponding author for this work
  • Hangzhou Normal University
  • University of Chinese Academy of Sciences
  • Hangzhou Dianzi University
  • Harbin Institute of Technology Shenzhen

Research output: Contribution to journalArticlepeer-review

Abstract

Long-term Video Question Answering (VideoQA) is a challenging vision-and-language bridging task focusing on semantic understanding of untrimmed long-term videos and diverse free-form questions, simultaneously emphasizing comprehensive cross-modal reasoning to yield precise answers. The canonical approaches often rely on off-the-shelf feature extractors to detour the expensive computation overhead, but often result in domain-independent modality-unrelated representations. Furthermore, the inherent gradient blocking between unimodal comprehension and cross-modal interaction hinders reliable answer generation. In contrast, recent emerging successful video-language pre-training models enable cost-effective end-to-end modeling but fall short in domain-specific ratiocination and exhibit disparities in task formulation. Toward this end, we present an entirely end-to-end solution for long-term VideoQA: Multi-granularity Contrastive cross-modal collaborative Generation (MCG) model. To derive discriminative representations possessing high visual concepts, we introduce Joint Unimodal Modeling (JUM) on a clip-bone architecture and leverage Multi-granularity Contrastive Learning (MCL) to harness the intrinsically or explicitly exhibited semantic correspondences. To alleviate the task formulation discrepancy problem, we propose a Cross-modal Collaborative Generation (CCG) module to reformulate VideoQA as a generative task instead of the conventional classification scheme, empowering the model with the capability for cross-modal high-semantic fusion and generation so as to rationalize and answer. Extensive experiments conducted on six publicly available VideoQA datasets underscore the superiority of our proposed method.

Original languageEnglish
Pages (from-to)3115-3129
Number of pages15
JournalIEEE Transactions on Image Processing
Volume33
DOIs
StatePublished - 2024
Externally publishedYes

Keywords

  • Video question answering
  • contrastive learning
  • cross-modal collaborative generation
  • end-to-end modeling
  • multi-granularity

Fingerprint

Dive into the research topics of 'Multi-Granularity Contrastive Cross-Modal Collaborative Generation for End-to-End Long-Term Video Question Answering'. Together they form a unique fingerprint.

Cite this