Abstract
Convolutional networks have achieved the state-of-the-art performance on Acoustic Scene Classification (ASC). Given the Log Mel-Spectrogram of an audio sample, the network can extract useful semantic contents in a certain range receptive field by stacking local convolutional operations. However, the temporal relations between different receptive fields are not captured explicitly. In this letter, we propose an end-to-end 3D Convolutional Neural Network (CNN) for ASC, named SeNoT-Net, which can generate effective audio representations by capturing temporal relations from semantic neighbors of different receptive fields over time. The SeNoT-Net treats the Log-Mel spectrogram as an ordered segment-level sequence. For each segment, the residual block can produce the semantic feature maps, then the semantic neighbors over time (SeNoT) module is applied to capture the relations between each feature point in the feature maps and its top-k semantic neighbors. The proposed SeNoT-Net outperforms most of the state-of-the-art CNN models on both DCASE 2018 and 2019 ASC datasets.
| Original language | English |
|---|---|
| Article number | 9097422 |
| Pages (from-to) | 950-954 |
| Number of pages | 5 |
| Journal | IEEE Signal Processing Letters |
| Volume | 27 |
| DOIs | |
| State | Published - 2020 |
| Externally published | Yes |
Keywords
- Acoustic scene classification
- ResNet
- end-to-end
- semantic neighbors
- temporal relations
Fingerprint
Dive into the research topics of 'Learning Temporal Relations from Semantic Neighbors for Acoustic Scene Classification'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver