Abstract
Video captioning, a challenging task that entails generating natural language descriptions of visual content, often fails to effectively grasp the essence of action semantics. To harness the power of action detection to facilitate a deeper understanding of the video content, we propose an action-driven method, named Hierarchical Semantic Representation and Aggregation (HSRA) network. This method explicitly exploits action clues with a hierarchical semantic representation module, which models visual semantics in a three-level structure: “object-action-event”. By employing learnable action queries, our approach injects extensive action semantics into the model, thereby enabling more accurate and context-rich captions. To further enhance semantic alignment and understanding, we introduce a semantic aggregation composed of a semantic interaction module and a semantic refinement module. This component facilitates the alignment of semantics across different levels and emphasizes key information, ultimately leading to significant improvements in semantic consistency between the video and generated captions. We performed extensive evaluations on two well-established public datasets, MSVD and MSR-VTT, and the findings consistently demonstrate that our proposed HSRA network outperforms contemporary state-of-the-art methods.
| Original language | English |
|---|---|
| Pages (from-to) | 3383-3395 |
| Number of pages | 13 |
| Journal | IEEE Transactions on Circuits and Systems for Video Technology |
| Volume | 35 |
| Issue number | 4 |
| DOIs | |
| State | Published - 2025 |
| Externally published | Yes |
Keywords
- Video summarization
- feature fusion
- motion attention
- multi-head attention
- parameter-free
Fingerprint
Dive into the research topics of 'Action-Driven Semantic Representation and Aggregation for Video Captioning'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver