Skip to main navigation Skip to search Skip to main content

Pyramidal Temporal Pooling with Discriminative Mapping for Audio Classification

  • Liwen Zhang
  • , Ziqiang Shi
  • , Jiqing Han*
  • *Corresponding author for this work
  • School of Computer Science and Technology, Harbin Institute of Technology
  • Fujitsu

Research output: Contribution to journalArticlepeer-review

Abstract

Audio signals are temporally-structured data, and learning their discriminative representations containing temporal information is crucial for the audio classification. In this article, we propose an audio representation learning method with a hierarchical pyramid structure called pyramidal temporal pooling (PTP) which aims to capture the temporal information of an entire audio sample. By stacking a global temporal pooling layer on multiple local temporal pooling layers, the PTP can capture the high-level temporal dynamics of the input feature sequence in an unsupervised way. Furthermore, in the top global temporal pooling layer, we jointly optimize a learnable discriminative mapping (DM) and a softmax classifier. Such that, a joint learning method for the discriminative audio representations and the classifier called DM-PTP is also presented. By treating the temporal encoding as a low-level constraint of a bi-level optimization problem, the DM-PTP can produce the discriminative representation while maintaining the temporal information of the whole sequence. For an audio sample with an arbitrary time duration, both our PTP and DM-PTP can encode the input feature sequence with arbitrary length into a fixed-length representation. Without using any data augmentation and ensemble learning methods, both PTP and DM-PTP outperform the state-of-the-art CNNs on the audio event recognition (AER) dataset, and can achieve comparable performance on the DCASE 2018 acoustic scene classification (ASC) dataset compared with other best models in the challenge.

Original languageEnglish
Article number8960462
Pages (from-to)770-784
Number of pages15
JournalIEEE/ACM Transactions on Audio Speech and Language Processing
Volume28
DOIs
StatePublished - 2020
Externally publishedYes

Keywords

  • Audio classification
  • bi-level optimization
  • convolutional neural network
  • discriminative mapping (DM)
  • temporal pooling

Fingerprint

Dive into the research topics of 'Pyramidal Temporal Pooling with Discriminative Mapping for Audio Classification'. Together they form a unique fingerprint.

Cite this