Abstract
Due to its wide applications, multimodal emotion recognition has gained increasing research attention. Although existing methods have achieved compelling success with various multimodal fusion methods, they overlook that the dominated modality (e.g., text) may cause a shortcut and hence negatively affect the representation learning of other modalities (e.g., image and audio). To alleviate such a problem, we resort to the knowledge distillation to narrow the gap between different modalities. In particular, we develop a new hierarchical knowledge distillation model for multi-modal emotion recognition (HKD-MER), consisting of three components, feature extraction, hierarchical knowledge distillation, and attentive multi-modal fusion. As the major contribution in our proposed model, the hierarchical knowledge distillation is designed to transfer the knowledge from the dominant modality to the others at both the feature and label levels. It boosts the performance of non-dominated modalities by modeling the inter-modal relation between different modalities. We have justified the effectiveness of our proposed model over two benchmark datasets.
| Original language | English |
|---|---|
| Pages (from-to) | 9036-9046 |
| Number of pages | 11 |
| Journal | IEEE Transactions on Multimedia |
| Volume | 26 |
| DOIs | |
| State | Published - 2024 |
| Externally published | Yes |
Keywords
- Contrastive learning
- emotion recognition
- knowledge distillation
- multi-modal representation learning
Fingerprint
Dive into the research topics of 'Muti-Modal Emotion Recognition via Hierarchical Knowledge Distillation'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver