Skip to main navigation Skip to search Skip to main content

Action-Driven Semantic Representation and Aggregation for Video Captioning

  • Tingting Han*
  • , Yaochen Xu
  • , Jun Yu
  • , Zhou Yu
  • , Sicheng Zhao
  • *Corresponding author for this work
  • Hangzhou Dianzi University
  • Tsinghua University

Research output: Contribution to journalArticlepeer-review

Abstract

Video captioning, a challenging task that entails generating natural language descriptions of visual content, often fails to effectively grasp the essence of action semantics. To harness the power of action detection to facilitate a deeper understanding of the video content, we propose an action-driven method, named Hierarchical Semantic Representation and Aggregation (HSRA) network. This method explicitly exploits action clues with a hierarchical semantic representation module, which models visual semantics in a three-level structure: “object-action-event”. By employing learnable action queries, our approach injects extensive action semantics into the model, thereby enabling more accurate and context-rich captions. To further enhance semantic alignment and understanding, we introduce a semantic aggregation composed of a semantic interaction module and a semantic refinement module. This component facilitates the alignment of semantics across different levels and emphasizes key information, ultimately leading to significant improvements in semantic consistency between the video and generated captions. We performed extensive evaluations on two well-established public datasets, MSVD and MSR-VTT, and the findings consistently demonstrate that our proposed HSRA network outperforms contemporary state-of-the-art methods.

Original languageEnglish
Pages (from-to)3383-3395
Number of pages13
JournalIEEE Transactions on Circuits and Systems for Video Technology
Volume35
Issue number4
DOIs
StatePublished - 2025
Externally publishedYes

Keywords

  • Video summarization
  • feature fusion
  • motion attention
  • multi-head attention
  • parameter-free

Fingerprint

Dive into the research topics of 'Action-Driven Semantic Representation and Aggregation for Video Captioning'. Together they form a unique fingerprint.

Cite this