Skip to main navigation Skip to search Skip to main content

AVMSN: An Audio-Visual Two Stream Crowd Counting Framework under Low-Quality Conditions

  • Ruihan Hu
  • , Qinglong Mo*
  • , Yuanfei Xie
  • , Yongqian Xu
  • , Jiaqi Chen
  • , Yalun Yang
  • , Hongjian Zhou
  • , Zhi Ri Tang*
  • , Edmond Q. Wu*
  • *Corresponding author for this work
  • Institute of Intelligent Manufacturing, Guangdong Academy of Sciences
  • Shihezi University
  • Wuhan University
  • Shanghai Jiao Tong University
  • Wuhan University of Science and Technology
  • Wuhan Institute of Technology

Research output: Contribution to journalArticlepeer-review

Abstract

Crowd counting is considered as the essential computer vision application that uses the convolutional neural network to model the crowd density as the regression task. However, the vision-based models are hard to extract the feature under low-quality conditions. As we know, visual and audio are used widely as media platforms for human beings to touch the physical change of the world. The cross-modal information gives us an alternative method of solving the crowd counting task. In this case, in order to solve this problem, a model named the Audio-Visual Multi-Scale Network (AVMSN) is established to model the unconstrained visual and audio sources for completing the crowd counting task in this paper. Based on the Feature extraction and Multi-modal fusion module, in order to handle the objects of various sizes in the crowd scene, the Sample Convolutional Blocks are adopted by the AVMSN as the multi-scale Vision-end branch in the Feature extraction module to calculate the weighted-visual feature. Besides, the audio, which is the temporal domain transformed into the spectrogram information and the audio feature is learned by the audio-VGG network. Finally, the weighted-visual and audio features are fused by the Multi-modal fusion module, which adopts the cascade fusion architecture to calculate the estimated density map. The experimental results show the proposed AVMSN achieves a lower mean absolute error than other state-of-art crowd counting models under the low-quality conditions.

Original languageEnglish
Article number9416332
Pages (from-to)80500-80510
Number of pages11
JournalIEEE Access
Volume9
DOIs
StatePublished - 2021
Externally publishedYes

Keywords

  • Multi-scale architecture
  • audio-visual model
  • cascade fusion
  • crowd counting

Fingerprint

Dive into the research topics of 'AVMSN: An Audio-Visual Two Stream Crowd Counting Framework under Low-Quality Conditions'. Together they form a unique fingerprint.

Cite this