TY - GEN
T1 - ESED
T2 - 34th ACM International Conference on Information and Knowledge Management, CIKM 2025
AU - Xiong, Zechang
AU - Ji, Zhenyan
AU - Kong, Wenkang
AU - Dai, Jiuqian
AU - Yin, Shen
N1 - Publisher Copyright:
© 2025 Copyright held by the owner/author(s).
PY - 2025/11/10
Y1 - 2025/11/10
N2 - Multimodal emotion recognition in conversations is inherently challenging due to ambiguous cues, modality conflicts, and temporal dynamics, all of which contribute to complex and diverse uncertainty sources. While some recent methods incorporate uncertainty modeling, they often focus on overall prediction confidence, without explicitly distinguishing the different sources of uncertainty introduced by underlying factors. To address these challenges, we propose a novel Emotion-Specific Evidence Decomposition framework (ESED) that leverages evidential deep learning to explicitly model and disentangle multimodal emotional uncertainty. Rather than directly fusing features, ESED decomposes each modality's evidence into three interpretable components: (1) emotion-consistent evidence, capturing shared emotional cues across modalities; (2) emotion-specific evidence, highlighting the unique emotional role of each modality; and (3) dynamic evidence, modeling utterance-level temporal variations. These components are adaptively weighted based on emotional intensity, ambiguity, and dynamicity, quantified via prediction entropy, inter-modal divergence, and temporal variance. The final prediction is obtained through an adaptive fusion of these weighted components. Extensive experiments demonstrate that ESED outperforms the state-of-the-art methods on the MELD and IEMOCAP datasets, demonstrating the effectiveness of our proposed method.
AB - Multimodal emotion recognition in conversations is inherently challenging due to ambiguous cues, modality conflicts, and temporal dynamics, all of which contribute to complex and diverse uncertainty sources. While some recent methods incorporate uncertainty modeling, they often focus on overall prediction confidence, without explicitly distinguishing the different sources of uncertainty introduced by underlying factors. To address these challenges, we propose a novel Emotion-Specific Evidence Decomposition framework (ESED) that leverages evidential deep learning to explicitly model and disentangle multimodal emotional uncertainty. Rather than directly fusing features, ESED decomposes each modality's evidence into three interpretable components: (1) emotion-consistent evidence, capturing shared emotional cues across modalities; (2) emotion-specific evidence, highlighting the unique emotional role of each modality; and (3) dynamic evidence, modeling utterance-level temporal variations. These components are adaptively weighted based on emotional intensity, ambiguity, and dynamicity, quantified via prediction entropy, inter-modal divergence, and temporal variance. The final prediction is obtained through an adaptive fusion of these weighted components. Extensive experiments demonstrate that ESED outperforms the state-of-the-art methods on the MELD and IEMOCAP datasets, demonstrating the effectiveness of our proposed method.
KW - emotion recognition in conversation
KW - evidential deep learning
KW - multimodal fusion
UR - https://www.scopus.com/pages/publications/105023151776
U2 - 10.1145/3746252.3761430
DO - 10.1145/3746252.3761430
M3 - 会议稿件
AN - SCOPUS:105023151776
T3 - CIKM 2025 - Proceedings of the 34th ACM International Conference on Information and Knowledge Management
SP - 3582
EP - 3591
BT - CIKM 2025 - Proceedings of the 34th ACM International Conference on Information and Knowledge Management
PB - Association for Computing Machinery, Inc
Y2 - 10 November 2025 through 14 November 2025
ER -