Skip to main navigation Skip to search Skip to main content

MDKAT: Multimodal Decoupling With Knowledge Aggregation and Transfer for Video Emotion Recognition

  • Jian Wang
  • , Chenglong Wang
  • , Lin Guo
  • , Shuchang Zhao
  • , Dandan Wang
  • , Shiqing Zhang*
  • , Xiaoming Zhao*
  • , Jun Yu
  • , Yaowei Wang
  • , Yi Yang
  • , Siwei Ma
  • , Qi Tian
  • *Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

Abstract

Multimodal Emotion Recognition (MER) leverages multiple input signals to identify the expressed emotions in user-generated data. Currently, effectively addressing both modality heterogeneity and homogeneity on MER tasks is a challenging issue due to the diversity of multimodal inputs in videos. To address this issue, this work proposes an efficient Multimodal Decoupling Method with Knowledge Aggregation and Transfer (MDKAT) for robust multimodal feature learning in emotional videos. MDKAT is consisted of three key steps: modality-independent feature extraction, modality-specific feature extraction, and multi-loss integration for decoupling. In these three steps, four crucial modules are individually designed to improve different aspects of multimodal learning on MER tasks, including a Cross-modal Feature Fusion (CFF) module for enhancing modality-independent features, an Adaptive Masked Self-Attention (AMSA) module for feature refinement, a Knowledge Aggregation (KA) module for ensuring the semantic similarity of modality-independent features, and a Knowledge Transfer (KT) module for balancing the strengths of different modalities. Experimental results on the typical CMU-MOSI and CMU-MOSEI datasets show that MDKAT obtains superior performance over state-of-the-art methods, demonstrating the effectiveness of MDKAT on MER tasks.

Original languageEnglish
Pages (from-to)9809-9822
Number of pages14
JournalIEEE Transactions on Circuits and Systems for Video Technology
Volume35
Issue number10
DOIs
StatePublished - 2025
Externally publishedYes

Keywords

  • Multimodal emotion recognition
  • attention
  • decoupling
  • knowledge aggregation
  • knowledge transfer

Fingerprint

Dive into the research topics of 'MDKAT: Multimodal Decoupling With Knowledge Aggregation and Transfer for Video Emotion Recognition'. Together they form a unique fingerprint.

Cite this