Skip to main navigation Skip to search Skip to main content

Toward Long Video Understanding via Fine-Detailed Video Story Generation

  • Zeng You
  • , Zhiquan Wen
  • , Yaofo Chen
  • , Xin Li
  • , Runhao Zeng*
  • , Yaowei Wang*
  • , Mingkui Tan*
  • *Corresponding author for this work
  • South China University of Technology
  • Peng Cheng Laboratory
  • Shenzhen MSU-BIT University
  • Shenzhen University
  • School of Computer Science and Technology, Harbin Institute of Technology

Research output: Contribution to journalArticlepeer-review

Abstract

Long video understanding has become a critical task in computer vision, driving advancements across numerous applications from surveillance to content retrieval. Existing video understanding methods suffer from two challenges when dealing with long video understanding: intricate long-context relationship modeling and interference from redundancy. To tackle these challenges, we introduce Fine-Detailed Video Story generation (FDVS), which interprets long videos into detailed textual representations. Specifically, to achieve fine-grained modeling of long-temporal content, we propose a Bottom-up Video Interpretation Mechanism that progressively interprets video content from clips to video. To avoid interference from redundant information in videos, we introduce a Semantic Redundancy Reduction mechanism that removes redundancy at both the visual and textual levels. Our method transforms long videos into hierarchical textual representations that contain multi-granularity information of the video. With these representations, FDVS is applicable to various tasks without any fine-tuning. We evaluate the proposed method across eight datasets spanning three tasks. The performance demonstrates the effectiveness and versatility of our method.

Original languageEnglish
Pages (from-to)4592-4607
Number of pages16
JournalIEEE Transactions on Circuits and Systems for Video Technology
Volume35
Issue number5
DOIs
StatePublished - 2025
Externally publishedYes

Keywords

  • Foundation models
  • large language models
  • video understanding

Fingerprint

Dive into the research topics of 'Toward Long Video Understanding via Fine-Detailed Video Story Generation'. Together they form a unique fingerprint.

Cite this