Skip to main navigation Skip to search Skip to main content

Two-Stage Unet with Gated-Conv Fusion for Binaural Audio Synthesis

  • Wenjie Zhang
  • , Changjun He
  • , Yinghan Cao
  • , Shiyun Xu
  • , Mingjiang Wang*
  • *Corresponding author for this work
  • Harbin Institute of Technology Shenzhen

Research output: Contribution to journalArticlepeer-review

Abstract

Binaural audio is crucial for creating immersive auditory experiences. However, due to the high cost and technical complexity of capturing binaural audio in real-world environments, there has been increasing interest in synthesizing binaural audio from monaural sources. In this paper, we propose a two-stage framework for binaural audio synthesis. Specifically, monaural audio is initially transformed into a preliminary binaural signal, and the shared common portion across the left and right channels, as well as the distinct differential portion in each channel, are extracted. Subsequently, the POS-ORI self-attention module (POSA) is introduced to integrate spatial information of the sound sources and capture their motion. Based on this representation, the common and differential components are separately reconstructed. The gated-convolutional fusion module (GCFM) is then employed to combine the reconstructed components and generate the final binaural audio. Experimental results demonstrate that the proposed method can accurately synthesize binaural audio and achieves state-of-the-art performance in phase estimation (Phase- (Formula presented.) : 0.789, Wave- (Formula presented.) : 0.147, Amplitude- (Formula presented.) : 0.036).

Original languageEnglish
Article number1790
JournalSensors
Volume25
Issue number6
DOIs
StatePublished - Mar 2025
Externally publishedYes

Keywords

  • UNet
  • binaural audio synthesis
  • self-attention
  • spatial perception

Fingerprint

Dive into the research topics of 'Two-Stage Unet with Gated-Conv Fusion for Binaural Audio Synthesis'. Together they form a unique fingerprint.

Cite this