Skip to main navigation Skip to search Skip to main content

BEATS: Audio Pre-Training with Acoustic Tokenizers

  • Sanyuan Chen*
  • , Yu Wu*
  • , Chengyi Wang
  • , Shujie Liu
  • , Daniel Tompkins
  • , Zhuo Chen
  • , Wanxiang Che
  • , Xiangzhan Yu
  • , Furu Wei
  • *Corresponding author for this work
  • Harbin Institute of Technology
  • Microsoft USA
  • Nankai University

Research output: Contribution to journalConference articlepeer-review

Abstract

We introduce a self-supervised learning (SSL) framework BEATS for general audio representation pre-training, where we optimize an acoustic tokenizer and an audio SSL model by iterations. Unlike the previous audio SSL models that employ reconstruction loss for pre-training, our audio SSL model is trained with the discrete label prediction task, where the labels are generated by a semantic-rich acoustic tokenizer. We propose an iterative pipeline to jointly optimize the tokenizer and the pre-trained model, aiming to abstract high-level semantics and discard the redundant details for audio. The experimental results demonstrate our acoustic tokenizers can generate discrete labels with rich audio semantics and our audio SSL models achieve state-of-the-art (SOTA) results across various audio classification benchmarks, even outperforming previous models that use more training data and model parameters significantly. Specifically, we set a new SOTA mAP 50.6% on AudioSet-2M without using any external data, and 98.1% accuracy on ESC-50. The code and pre-trained models are available at https://aka.ms/beats.

Original languageEnglish
Pages (from-to)4672-4712
Number of pages41
JournalProceedings of Machine Learning Research
Volume202
StatePublished - 2023
Event40th International Conference on Machine Learning, ICML 2023 - Honolulu, United States
Duration: 23 Jul 202329 Jul 2023

Fingerprint

Dive into the research topics of 'BEATS: Audio Pre-Training with Acoustic Tokenizers'. Together they form a unique fingerprint.

Cite this