Abstract
Multimodal Summarization with Multi-modal Output (MSMO) is an emerging field focused on generating reliable and high-quality summaries by integrating various media types, such as text and video. Current methods primarily focus on integrating features from different modalities, but often overlook further enhancement and optimization of the fused features. This limitation can reduce the representational capacity of the fusion, ultimately diminishing overall performance. To address these challenges, a novel Modality-aware Contrast and Fusion (MCF) network has been proposed. This network leverages contrastive learning to preserve the integrity of modality-specific semantics while promoting the complementary integration of different media types. The Multi-Modal Attention (MMA) module captures temporal dependencies and learns discriminative semantics for individual media types through uni-modal semantic attention, while aligning and integrating semantics from multiple sources via cross-modal semantic attention. The Uni-Cross Contrastive Learning (UCC) module minimizes modality-aware contrastive losses to enhance the distinctiveness of semantic representations. The Modality-Aware Fusion (MAF) module dynamically adjusts the contributions of uni-modal and cross-modal outputs during the summarization process, optimizing the integration based on the strengths of each modality. Extensive validation on the Bliss, Daily Mail, and CNN datasets demonstrates the state-of-the-art performance of the MCF network and confirms the effectiveness of its components.
| Original language | English |
|---|---|
| Article number | 130094 |
| Journal | Neurocomputing |
| Volume | 639 |
| DOIs | |
| State | Published - 28 Jul 2025 |
| Externally published | Yes |
Keywords
- Contrastive learning
- Cross-modal fusion
- Multi-modal summarization
Fingerprint
Dive into the research topics of 'Modality-aware contrast and fusion for multi-modal summarization'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver