Abstract
Multimodal Emotion Recognition (MER) leverages multiple input signals to identify the expressed emotions in user-generated data. Currently, effectively addressing both modality heterogeneity and homogeneity on MER tasks is a challenging issue due to the diversity of multimodal inputs in videos. To address this issue, this work proposes an efficient Multimodal Decoupling Method with Knowledge Aggregation and Transfer (MDKAT) for robust multimodal feature learning in emotional videos. MDKAT is consisted of three key steps: modality-independent feature extraction, modality-specific feature extraction, and multi-loss integration for decoupling. In these three steps, four crucial modules are individually designed to improve different aspects of multimodal learning on MER tasks, including a Cross-modal Feature Fusion (CFF) module for enhancing modality-independent features, an Adaptive Masked Self-Attention (AMSA) module for feature refinement, a Knowledge Aggregation (KA) module for ensuring the semantic similarity of modality-independent features, and a Knowledge Transfer (KT) module for balancing the strengths of different modalities. Experimental results on the typical CMU-MOSI and CMU-MOSEI datasets show that MDKAT obtains superior performance over state-of-the-art methods, demonstrating the effectiveness of MDKAT on MER tasks.
| Original language | English |
|---|---|
| Pages (from-to) | 9809-9822 |
| Number of pages | 14 |
| Journal | IEEE Transactions on Circuits and Systems for Video Technology |
| Volume | 35 |
| Issue number | 10 |
| DOIs | |
| State | Published - 2025 |
| Externally published | Yes |
Keywords
- Multimodal emotion recognition
- attention
- decoupling
- knowledge aggregation
- knowledge transfer
Fingerprint
Dive into the research topics of 'MDKAT: Multimodal Decoupling With Knowledge Aggregation and Transfer for Video Emotion Recognition'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver