TY - GEN
T1 - Multi-modal Emotion Recognition Based on Deep Learning in Speech, Video and Text
AU - Zhang, Xue
AU - Wang, Ming Jiang
AU - Guo, Xing Da
N1 - Publisher Copyright:
© 2020 IEEE.
PY - 2020/10/23
Y1 - 2020/10/23
N2 - Emotions are a concrete manifestation of human communication, and the research on emotion recognition has gradually increased. Recently, researchers have attached great importance to multi-modal emotion recognition, and in the field of speech, video, text and physiological signal emotion recognition, a lot of research work has been carried out. Multimodal emotion recognition complements each other by fusing information between different modalities, thereby improving the final recognition rate. This paper preprocesses the three modes of speech, video and text of the IEMOCAP dataset, uses deep learning neural networks to extract emotional features, and performs information fusion at the feature layer. There are five types of emotions: angry, excited, sad, neutral and happy. From the results, the accuracy of the three-mode emotion recognition model of the training set is 0.9541, and that of the verification set is 0.68383. Compared to speech emotion recognition improved by 0.11751.
AB - Emotions are a concrete manifestation of human communication, and the research on emotion recognition has gradually increased. Recently, researchers have attached great importance to multi-modal emotion recognition, and in the field of speech, video, text and physiological signal emotion recognition, a lot of research work has been carried out. Multimodal emotion recognition complements each other by fusing information between different modalities, thereby improving the final recognition rate. This paper preprocesses the three modes of speech, video and text of the IEMOCAP dataset, uses deep learning neural networks to extract emotional features, and performs information fusion at the feature layer. There are five types of emotions: angry, excited, sad, neutral and happy. From the results, the accuracy of the three-mode emotion recognition model of the training set is 0.9541, and that of the verification set is 0.68383. Compared to speech emotion recognition improved by 0.11751.
KW - Multimodal emotion recognition
KW - deep learning
KW - feature level fusion
UR - https://www.scopus.com/pages/publications/85101068046
U2 - 10.1109/ICSIP49896.2020.9339464
DO - 10.1109/ICSIP49896.2020.9339464
M3 - 会议稿件
AN - SCOPUS:85101068046
T3 - 2020 IEEE 5th International Conference on Signal and Image Processing, ICSIP 2020
SP - 328
EP - 333
BT - 2020 IEEE 5th International Conference on Signal and Image Processing, ICSIP 2020
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 5th IEEE International Conference on Signal and Image Processing, ICSIP 2020
Y2 - 23 October 2020 through 25 October 2020
ER -