Abstract
Audio signals are temporally-structured data, and learning their discriminative representations containing temporal information is crucial for the audio classification. In this article, we propose an audio representation learning method with a hierarchical pyramid structure called pyramidal temporal pooling (PTP) which aims to capture the temporal information of an entire audio sample. By stacking a global temporal pooling layer on multiple local temporal pooling layers, the PTP can capture the high-level temporal dynamics of the input feature sequence in an unsupervised way. Furthermore, in the top global temporal pooling layer, we jointly optimize a learnable discriminative mapping (DM) and a softmax classifier. Such that, a joint learning method for the discriminative audio representations and the classifier called DM-PTP is also presented. By treating the temporal encoding as a low-level constraint of a bi-level optimization problem, the DM-PTP can produce the discriminative representation while maintaining the temporal information of the whole sequence. For an audio sample with an arbitrary time duration, both our PTP and DM-PTP can encode the input feature sequence with arbitrary length into a fixed-length representation. Without using any data augmentation and ensemble learning methods, both PTP and DM-PTP outperform the state-of-the-art CNNs on the audio event recognition (AER) dataset, and can achieve comparable performance on the DCASE 2018 acoustic scene classification (ASC) dataset compared with other best models in the challenge.
| Original language | English |
|---|---|
| Article number | 8960462 |
| Pages (from-to) | 770-784 |
| Number of pages | 15 |
| Journal | IEEE/ACM Transactions on Audio Speech and Language Processing |
| Volume | 28 |
| DOIs | |
| State | Published - 2020 |
| Externally published | Yes |
Keywords
- Audio classification
- bi-level optimization
- convolutional neural network
- discriminative mapping (DM)
- temporal pooling
Fingerprint
Dive into the research topics of 'Pyramidal Temporal Pooling with Discriminative Mapping for Audio Classification'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver