Skip to main navigation Skip to search Skip to main content

Adapting Single-Channel Pre-trained Transformer Models for Multi-Channel Sound Event Localization and Detection

  • Faculty of Computing, Harbin Institute of Technology
  • Harbin Institute of Technology

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

In recent years, the significance of pre-trained transformer audio models has been increasingly recognized. However, existing pre-trained transformer audio models are based on single-channel audio. They cannot be directly applied to multi-channel audio for Sound Event Localization and Detection (SELD) tasks. To address this issue, in this paper, we propose SELD-SSAST, a novel model based on the single-channel Self-Supervised Audio Spectrogram Transformer (SSAST). Specifically, we first introduce a fusion feature that enables SSAST to learn the unique features in SELD problems effectively. Secondly, we input the multi-channel audio features into a single SSAST module to learn the temporal information across channels through channel-mixing. Finally, to enable SSAST to learn the relationships between multi-channel audio features, we propose a Convolutional Cross Attention (CCA) module to replace the Transformer's Self-Attention and an intensity vector (IV) enhanced module to learn the differences between channel features. Our experiments show that using SELD-SSAST improved performance by 23.5% and 20.2% over the baseline on two datasets, respectively. Additionally, with the same data scale, SELD-SSAST outperforms the models in state-of-the-art (SOTA) methods on two datasets.

Original languageEnglish
Title of host publication2025 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2025 - Proceedings
EditorsBhaskar D Rao, Isabel Trancoso, Gaurav Sharma, Neelesh B. Mehta
PublisherInstitute of Electrical and Electronics Engineers Inc.
ISBN (Electronic)9798350368741
DOIs
StatePublished - 2025
Event2025 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2025 - Hyderabad, India
Duration: 6 Apr 202511 Apr 2025

Publication series

NameICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
ISSN (Print)1520-6149

Conference

Conference2025 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2025
Country/TerritoryIndia
CityHyderabad
Period6/04/2511/04/25

Keywords

  • convolutional cross attention
  • fusion feature
  • self-supervised audio spectrogram transformer
  • sound event localization and detection

Fingerprint

Dive into the research topics of 'Adapting Single-Channel Pre-trained Transformer Models for Multi-Channel Sound Event Localization and Detection'. Together they form a unique fingerprint.

Cite this