Skip to main navigation Skip to search Skip to main content

FSformer: Sparsely and effectively learning key features for multi-channel speech enhancement

  • Shiyun Xu
  • , Wenjie Zhang
  • , Yinghan Cao
  • , Zehua Zhang
  • , Changjun He
  • , Mingjiang Wang*
  • *Corresponding author for this work
  • Harbin Institute of Technology Shenzhen

Research output: Contribution to journalArticlepeer-review

Abstract

Noise and reverberation can significantly degrade the quality and intelligibility of speech. Therefore, multi-channel speech enhancement models that effectively leverage spatial information have garnered widespread attention. The Transformer architecture has demonstrated impressive performance in multi-channel speech enhancement. However, the redundant features extracted by the self-attention mechanism hinder the network's ability to capture local characteristics, resulting in the loss of speech details. To address the aforementioned issues, we propose the fused sparse transformer (FSformer) to assist the network in learning key features sparsely and effectively. We introduce the fused sparse self-attention (FSSA) module, which selects only the top-k features with the highest contribution scores when computing the self-attention map and employs a fusion strategy to adaptively retain the most valuable features. Furthermore, the local feature refinement extractor (L-FRE) and global feature refinement extractor (G-FRE) are introduced in FSSA to enhance the interaction between global and local features. Additionally, we propose the partial gated feed-forward network (GPFN), which utilizes partial convolution to further enhance the feature extraction capability of the network and employs the gating mechanism to reduce redundancy within channels, thereby compensating for the shortcomings of FSSA. The experimental results indicate that FSformer demonstrates a significant advantage in terms of speech enhancement performance, effectively and naturally improving speech quality and intelligibility, thereby providing a pleasant experience for listeners. Specifically, on the spatialized DNS dataset, FSformer achieves PESQ, STOI, and SI-SDR scores of 3.40, 0.952, and 10.9, respectively. FSformer also demonstrates exceptional performance in suppressing noise and reverberation across various levels of noise and reverberation environments. In the test set containing noise and reverberation, FSformer achieves a PESQ score of 3.41, a STOI score of 0.959, a SI-SDR score of 10.9, a DNSMOS score of 3.525, a CD of 2.527, a LLR of 0.27, and a SNRfw of 13.434. Furthermore, FSformer demonstrates superior generalization capabilities, achieving a DNSMOS of 3.163, a MOSP.808 of 3.762, and an NISQA of 3.779 on real datasets.

Original languageEnglish
Article number110858
JournalApplied Acoustics
Volume240
DOIs
StatePublished - 5 Dec 2025
Externally publishedYes

Keywords

  • Gating mechanism
  • Multi-channel speech enhancement
  • Partial convolution
  • Sparse self-attention
  • Transformer

Fingerprint

Dive into the research topics of 'FSformer: Sparsely and effectively learning key features for multi-channel speech enhancement'. Together they form a unique fingerprint.

Cite this