TY - GEN
T1 - Adapting Single-Channel Pre-trained Transformer Models for Multi-Channel Sound Event Localization and Detection
AU - He, Changjiang
AU - Cheng, Siyao
AU - Bao, Jiahua
AU - Liu, Jie
N1 - Publisher Copyright:
© 2025 IEEE.
PY - 2025
Y1 - 2025
N2 - In recent years, the significance of pre-trained transformer audio models has been increasingly recognized. However, existing pre-trained transformer audio models are based on single-channel audio. They cannot be directly applied to multi-channel audio for Sound Event Localization and Detection (SELD) tasks. To address this issue, in this paper, we propose SELD-SSAST, a novel model based on the single-channel Self-Supervised Audio Spectrogram Transformer (SSAST). Specifically, we first introduce a fusion feature that enables SSAST to learn the unique features in SELD problems effectively. Secondly, we input the multi-channel audio features into a single SSAST module to learn the temporal information across channels through channel-mixing. Finally, to enable SSAST to learn the relationships between multi-channel audio features, we propose a Convolutional Cross Attention (CCA) module to replace the Transformer's Self-Attention and an intensity vector (IV) enhanced module to learn the differences between channel features. Our experiments show that using SELD-SSAST improved performance by 23.5% and 20.2% over the baseline on two datasets, respectively. Additionally, with the same data scale, SELD-SSAST outperforms the models in state-of-the-art (SOTA) methods on two datasets.
AB - In recent years, the significance of pre-trained transformer audio models has been increasingly recognized. However, existing pre-trained transformer audio models are based on single-channel audio. They cannot be directly applied to multi-channel audio for Sound Event Localization and Detection (SELD) tasks. To address this issue, in this paper, we propose SELD-SSAST, a novel model based on the single-channel Self-Supervised Audio Spectrogram Transformer (SSAST). Specifically, we first introduce a fusion feature that enables SSAST to learn the unique features in SELD problems effectively. Secondly, we input the multi-channel audio features into a single SSAST module to learn the temporal information across channels through channel-mixing. Finally, to enable SSAST to learn the relationships between multi-channel audio features, we propose a Convolutional Cross Attention (CCA) module to replace the Transformer's Self-Attention and an intensity vector (IV) enhanced module to learn the differences between channel features. Our experiments show that using SELD-SSAST improved performance by 23.5% and 20.2% over the baseline on two datasets, respectively. Additionally, with the same data scale, SELD-SSAST outperforms the models in state-of-the-art (SOTA) methods on two datasets.
KW - convolutional cross attention
KW - fusion feature
KW - self-supervised audio spectrogram transformer
KW - sound event localization and detection
UR - https://www.scopus.com/pages/publications/105003871175
U2 - 10.1109/ICASSP49660.2025.10887709
DO - 10.1109/ICASSP49660.2025.10887709
M3 - 会议稿件
AN - SCOPUS:105003871175
T3 - ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
BT - 2025 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2025 - Proceedings
A2 - Rao, Bhaskar D
A2 - Trancoso, Isabel
A2 - Sharma, Gaurav
A2 - Mehta, Neelesh B.
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2025 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2025
Y2 - 6 April 2025 through 11 April 2025
ER -