Abstract
Binaural audio is crucial for creating immersive auditory experiences. However, due to the high cost and technical complexity of capturing binaural audio in real-world environments, there has been increasing interest in synthesizing binaural audio from monaural sources. In this paper, we propose a two-stage framework for binaural audio synthesis. Specifically, monaural audio is initially transformed into a preliminary binaural signal, and the shared common portion across the left and right channels, as well as the distinct differential portion in each channel, are extracted. Subsequently, the POS-ORI self-attention module (POSA) is introduced to integrate spatial information of the sound sources and capture their motion. Based on this representation, the common and differential components are separately reconstructed. The gated-convolutional fusion module (GCFM) is then employed to combine the reconstructed components and generate the final binaural audio. Experimental results demonstrate that the proposed method can accurately synthesize binaural audio and achieves state-of-the-art performance in phase estimation (Phase- (Formula presented.) : 0.789, Wave- (Formula presented.) : 0.147, Amplitude- (Formula presented.) : 0.036).
| Original language | English |
|---|---|
| Article number | 1790 |
| Journal | Sensors |
| Volume | 25 |
| Issue number | 6 |
| DOIs | |
| State | Published - Mar 2025 |
| Externally published | Yes |
Keywords
- UNet
- binaural audio synthesis
- self-attention
- spatial perception
Fingerprint
Dive into the research topics of 'Two-Stage Unet with Gated-Conv Fusion for Binaural Audio Synthesis'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver