Skip to main navigation Skip to search Skip to main content

RTQ: Rethinking Video-language Understanding Based on Image-text Model

  • Xiao Wang
  • , Yaoyu Li
  • , Tian Gan*
  • , Zheng Zhang
  • , Jingjing Lv
  • , Liqiang Nie
  • *Corresponding author for this work
  • Harbin Institute of Technology Shenzhen
  • JD.com, Inc.
  • Shandong University

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Recent advancements in video-language understanding have been established on the foundation of image-text models, resulting in promising outcomes due to the shared knowledge between images and videos. However, video-language understanding presents unique challenges due to the inclusion of highly complex semantic details, which result in information redundancy, temporal dependency, and scene complexity. Current techniques have only partially tackled these issues, and our quantitative analysis indicates that some of these methods are complementary. In light of this, we propose a novel framework called RTQ (Refine, Temporal model, and Query), which addresses these challenges simultaneously. The approach involves refining redundant information within frames, modeling temporal relations among frames, and querying task-specific information from the videos. Remarkably, our model demonstrates outstanding performance even in the absence of video-language pre-training, and the results are comparable with or superior to those achieved by state-of-the-art pre-training methods.

Original languageEnglish
Title of host publicationMM 2023 - Proceedings of the 31st ACM International Conference on Multimedia
PublisherAssociation for Computing Machinery, Inc
Pages557-566
Number of pages10
ISBN (Electronic)9798400701085
DOIs
StatePublished - 27 Oct 2023
Externally publishedYes
Event31st ACM International Conference on Multimedia, MM 2023 - Ottawa, Canada
Duration: 29 Oct 20233 Nov 2023

Publication series

NameMM 2023 - Proceedings of the 31st ACM International Conference on Multimedia

Conference

Conference31st ACM International Conference on Multimedia, MM 2023
Country/TerritoryCanada
CityOttawa
Period29/10/233/11/23

Keywords

  • video caption
  • video question answering
  • video retrieval

Fingerprint

Dive into the research topics of 'RTQ: Rethinking Video-language Understanding Based on Image-text Model'. Together they form a unique fingerprint.

Cite this