Abstract
Video summarization has garnered significant attention because of its valuable ability to enhance the efficiency of video browsing. Humans can effectively condense long videos into concise summaries by drawing on extensive prior knowledge and utilizing multi-source information to identify the most relevant content. However, existing video summarization approaches fail to incorporate multimodal cues and overlook implicit knowledge, resulting in lower-quality generated summaries. To address the limitations, we propose an innovative knowledge-guided multimodal network for the video summarization task, referred to as KGMNet. Specifically, KGMNet employs a visual-audio encoder to extract informative visual and audio representations from the input video. To complement these signals with high-level knowledge, a knowledge-guided encoder integrates affective information from an external knowledge source and expert attributes from image aesthetics, enabling the extraction of implicit knowledge that is not directly available from raw data. Based on these representations, a fine-to-coarse space projection module captures inter-relations among different modality spaces, strengthening cross-modal consistency. Moreover, a prediction head refines temporal structure by jointly estimating importance score, boundary descriptor, and centrality measure for each frame, supporting smoother transitions and more accurate localization of salient events. The experimental results demonstrate that the proposed network achieves superior performance compared to the state-of-the-art methods on two benchmark datasets. Notably, KGMNet attains an F-score of 60.4% on SumMe and 69.9% on TVSum, outperforming existing approaches. Furthermore, ablation studies validate the positive contribution of each module within the proposed KGMNet.
| Original language | English |
|---|---|
| Article number | 132566 |
| Journal | Expert Systems with Applications |
| Volume | 324 |
| DOIs | |
| State | Published - 25 Aug 2026 |
| Externally published | Yes |
Keywords
- Implicit knowledge
- Multimodal cues
- Video summarization
Fingerprint
Dive into the research topics of 'A knowledge-guided multimodal network for video summarization'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver