Abstract
In the field of audio and video detection, violence detection is a crucial task with significant theoretical and practical implications. In order to solve the present issue of the lack of violent audio datasets, we first created our own audio violent dataset named VioAudio. Then, we proposed a CNN-ConvLSTM network model for audio violence detection, which obtained an accuracy of 91.5% on VioAudio and a MAP value of 16.47% on the MediaEval 2015 dataset. Meanwhile, this paper integrated self-attention mechanisms and visual information into CNN-ConvLSTM network in order to address the issue of modality singularity in violence detection, and then confirmed them on MediaEval2015 dataset. The experimental results demonstrate that after fusing visual and auditory information, the CNN-LSTM network model greatly enhanced recognition accuracy, attaining a 31.25% MAP value, which is 1.94% higher than the best result. The method proposed in this paper considerably increased the accuracy of violence detection and offered fresh perspectives on how to integrate multimodal information to identify violence.
| Original language | English |
|---|---|
| Title of host publication | Man-Machine Speech Communication - 17th National Conference, NCMMSC 2022, Proceedings |
| Editors | Ling Zhenhua, Gao Jianqing, Yu Kai, Jia Jia |
| Publisher | Springer Science and Business Media Deutschland GmbH |
| Pages | 208-220 |
| Number of pages | 13 |
| ISBN (Print) | 9789819924004 |
| DOIs | |
| State | Published - 2023 |
| Externally published | Yes |
| Event | 17th National Conference on Man-Machine Speech Communication, NCMMSC 2022 - Hefei, China Duration: 15 Dec 2022 → 18 Dec 2022 |
Publication series
| Name | Communications in Computer and Information Science |
|---|---|
| Volume | 1765 CCIS |
| ISSN (Print) | 1865-0929 |
| ISSN (Electronic) | 1865-0937 |
Conference
| Conference | 17th National Conference on Man-Machine Speech Communication, NCMMSC 2022 |
|---|---|
| Country/Territory | China |
| City | Hefei |
| Period | 15/12/22 → 18/12/22 |
UN SDGs
This output contributes to the following UN Sustainable Development Goals (SDGs)
-
SDG 16 Peace, Justice and Strong Institutions
Keywords
- Auditory and Visual Information Fusion
- Convolution Neural Network
- Long-Short Term Memory Network
- Violence Detection
Fingerprint
Dive into the research topics of 'Violence Detection Through Fusing Visual Information to Auditory Scene'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver