Skip to main navigation Skip to search Skip to main content

A knowledge-guided multimodal network for video summarization

  • Faculty of Computing, Harbin Institute of Technology
  • Harbin Institute of Technology
  • School of Mechatronics Engineering, Harbin Institute of Technology

Research output: Contribution to journalArticlepeer-review

Abstract

Video summarization has garnered significant attention because of its valuable ability to enhance the efficiency of video browsing. Humans can effectively condense long videos into concise summaries by drawing on extensive prior knowledge and utilizing multi-source information to identify the most relevant content. However, existing video summarization approaches fail to incorporate multimodal cues and overlook implicit knowledge, resulting in lower-quality generated summaries. To address the limitations, we propose an innovative knowledge-guided multimodal network for the video summarization task, referred to as KGMNet. Specifically, KGMNet employs a visual-audio encoder to extract informative visual and audio representations from the input video. To complement these signals with high-level knowledge, a knowledge-guided encoder integrates affective information from an external knowledge source and expert attributes from image aesthetics, enabling the extraction of implicit knowledge that is not directly available from raw data. Based on these representations, a fine-to-coarse space projection module captures inter-relations among different modality spaces, strengthening cross-modal consistency. Moreover, a prediction head refines temporal structure by jointly estimating importance score, boundary descriptor, and centrality measure for each frame, supporting smoother transitions and more accurate localization of salient events. The experimental results demonstrate that the proposed network achieves superior performance compared to the state-of-the-art methods on two benchmark datasets. Notably, KGMNet attains an F-score of 60.4% on SumMe and 69.9% on TVSum, outperforming existing approaches. Furthermore, ablation studies validate the positive contribution of each module within the proposed KGMNet.

Original languageEnglish
Article number132566
JournalExpert Systems with Applications
Volume324
DOIs
StatePublished - 25 Aug 2026
Externally publishedYes

Keywords

  • Implicit knowledge
  • Multimodal cues
  • Video summarization

Fingerprint

Dive into the research topics of 'A knowledge-guided multimodal network for video summarization'. Together they form a unique fingerprint.

Cite this