Modality-aware contrast and fusion for multi-modal summarization

  • Lixin Dai
  • , Tingting Han*
  • , Zhou Yu
  • , Jun Yu
  • , Min Tan
  • , Yang Liu
  • *Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

Abstract

Multimodal Summarization with Multi-modal Output (MSMO) is an emerging field focused on generating reliable and high-quality summaries by integrating various media types, such as text and video. Current methods primarily focus on integrating features from different modalities, but often overlook further enhancement and optimization of the fused features. This limitation can reduce the representational capacity of the fusion, ultimately diminishing overall performance. To address these challenges, a novel Modality-aware Contrast and Fusion (MCF) network has been proposed. This network leverages contrastive learning to preserve the integrity of modality-specific semantics while promoting the complementary integration of different media types. The Multi-Modal Attention (MMA) module captures temporal dependencies and learns discriminative semantics for individual media types through uni-modal semantic attention, while aligning and integrating semantics from multiple sources via cross-modal semantic attention. The Uni-Cross Contrastive Learning (UCC) module minimizes modality-aware contrastive losses to enhance the distinctiveness of semantic representations. The Modality-Aware Fusion (MAF) module dynamically adjusts the contributions of uni-modal and cross-modal outputs during the summarization process, optimizing the integration based on the strengths of each modality. Extensive validation on the Bliss, Daily Mail, and CNN datasets demonstrates the state-of-the-art performance of the MCF network and confirms the effectiveness of its components.

Original languageEnglish
Article number130094
JournalNeurocomputing
Volume639
DOIs
StatePublished - 28 Jul 2025
Externally publishedYes

Keywords

  • Contrastive learning
  • Cross-modal fusion
  • Multi-modal summarization

Fingerprint

Dive into the research topics of 'Modality-aware contrast and fusion for multi-modal summarization'. Together they form a unique fingerprint.

Cite this