Skip to main navigation Skip to search Skip to main content

Learning Temporal Relations from Semantic Neighbors for Acoustic Scene Classification

  • Liwen Zhang
  • , Jiqing Han*
  • , Ziqiang Shi
  • *Corresponding author for this work
  • School of Computer Science and Technology, Harbin Institute of Technology
  • Fujitsu

Research output: Contribution to journalArticlepeer-review

Abstract

Convolutional networks have achieved the state-of-the-art performance on Acoustic Scene Classification (ASC). Given the Log Mel-Spectrogram of an audio sample, the network can extract useful semantic contents in a certain range receptive field by stacking local convolutional operations. However, the temporal relations between different receptive fields are not captured explicitly. In this letter, we propose an end-to-end 3D Convolutional Neural Network (CNN) for ASC, named SeNoT-Net, which can generate effective audio representations by capturing temporal relations from semantic neighbors of different receptive fields over time. The SeNoT-Net treats the Log-Mel spectrogram as an ordered segment-level sequence. For each segment, the residual block can produce the semantic feature maps, then the semantic neighbors over time (SeNoT) module is applied to capture the relations between each feature point in the feature maps and its top-k semantic neighbors. The proposed SeNoT-Net outperforms most of the state-of-the-art CNN models on both DCASE 2018 and 2019 ASC datasets.

Original languageEnglish
Article number9097422
Pages (from-to)950-954
Number of pages5
JournalIEEE Signal Processing Letters
Volume27
DOIs
StatePublished - 2020
Externally publishedYes

Keywords

  • Acoustic scene classification
  • ResNet
  • end-to-end
  • semantic neighbors
  • temporal relations

Fingerprint

Dive into the research topics of 'Learning Temporal Relations from Semantic Neighbors for Acoustic Scene Classification'. Together they form a unique fingerprint.

Cite this