Skip to main navigation Skip to search Skip to main content

Violence Detection Through Fusing Visual Information to Auditory Scene

  • Faculty of Computing, Harbin Institute of Technology

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

In the field of audio and video detection, violence detection is a crucial task with significant theoretical and practical implications. In order to solve the present issue of the lack of violent audio datasets, we first created our own audio violent dataset named VioAudio. Then, we proposed a CNN-ConvLSTM network model for audio violence detection, which obtained an accuracy of 91.5% on VioAudio and a MAP value of 16.47% on the MediaEval 2015 dataset. Meanwhile, this paper integrated self-attention mechanisms and visual information into CNN-ConvLSTM network in order to address the issue of modality singularity in violence detection, and then confirmed them on MediaEval2015 dataset. The experimental results demonstrate that after fusing visual and auditory information, the CNN-LSTM network model greatly enhanced recognition accuracy, attaining a 31.25% MAP value, which is 1.94% higher than the best result. The method proposed in this paper considerably increased the accuracy of violence detection and offered fresh perspectives on how to integrate multimodal information to identify violence.

Original languageEnglish
Title of host publicationMan-Machine Speech Communication - 17th National Conference, NCMMSC 2022, Proceedings
EditorsLing Zhenhua, Gao Jianqing, Yu Kai, Jia Jia
PublisherSpringer Science and Business Media Deutschland GmbH
Pages208-220
Number of pages13
ISBN (Print)9789819924004
DOIs
StatePublished - 2023
Externally publishedYes
Event17th National Conference on Man-Machine Speech Communication, NCMMSC 2022 - Hefei, China
Duration: 15 Dec 202218 Dec 2022

Publication series

NameCommunications in Computer and Information Science
Volume1765 CCIS
ISSN (Print)1865-0929
ISSN (Electronic)1865-0937

Conference

Conference17th National Conference on Man-Machine Speech Communication, NCMMSC 2022
Country/TerritoryChina
CityHefei
Period15/12/2218/12/22

UN SDGs

This output contributes to the following UN Sustainable Development Goals (SDGs)

  1. SDG 16 - Peace, Justice and Strong Institutions
    SDG 16 Peace, Justice and Strong Institutions

Keywords

  • Auditory and Visual Information Fusion
  • Convolution Neural Network
  • Long-Short Term Memory Network
  • Violence Detection

Fingerprint

Dive into the research topics of 'Violence Detection Through Fusing Visual Information to Auditory Scene'. Together they form a unique fingerprint.

Cite this