AudioSpa: Spatializing Sound Events with Text

📅 2025-02-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the novel task of text-to-binaural spatial audio generation—synthesizing high-fidelity binaural audio with precise spatial localization from a monaural reference and a textual description (e.g., “birdsong on the left”). We propose the first end-to-end framework for this task. Methodologically, it integrates large language model–driven semantic understanding, multi-head attention–guided acoustic modeling, a binaural source localization evaluation network, and azimuth-aware data augmentation. Crucially, it is the first to explicitly map text instructions to 3D spatial coordinates and jointly optimize both localization accuracy and binaural signal fidelity. Experiments demonstrate state-of-the-art performance in both spatial localization accuracy and audio perceptual quality, significantly outperforming existing text-to-audio generation methods and conventional binaural rendering approaches.

Technology Category

Application Category

📝 Abstract
Text-to-audio (TTA) systems have recently demonstrated strong performance in synthesizing monaural audio from text. However, the task of generating binaural spatial audio from text, which provides a more immersive auditory experience by incorporating the sense of spatiality, have not been explored yet. In this work, we introduce text-guided binaural audio generation. As an early effort, we focus on the scenario where a monaural reference audio is given additionally. The core problem is to associate specific sound events with their directions, thereby creating binaural spatial audio. The challenge lies in the complexity of textual descriptions and the limited availability of single-source sound event datasets. To address this, we propose AudioSpa, an end-to-end model that applies large language models to process both acoustic and textual information. We employ fusion multi-head attention (FMHA) to integrate text tokens, which enhances the generation capability of the multimodal learning. Additionally, we propose a binaural source localization model to assess the quality of the generated audio. Finally, we design a data augmentation strategy to generate diverse datasets, which enables the model to spatialize sound events across various spatial positions. Experimental results demonstrate that our model is able to put sounds at the specified locations accurately. It achieves competitive performance in both localization accuracy and signal distortion. Our demonstrations are available at https://linfeng-feng.github.io/AudioSpa-demo.
Problem

Research questions and friction points this paper is trying to address.

Generating binaural spatial audio from text
Associating sound events with directions
Enhancing multimodal learning with fusion multi-head attention
Innovation

Methods, ideas, or system contributions that make the work stand out.

Text-guided binaural audio generation
Fusion multi-head attention integration
Binaural source localization model
🔎 Similar Papers
No similar papers found.
Linfeng Feng
Linfeng Feng
Northwestern Polytechnical University
Speech ProcessingMultimodal Learning
L
Lei Zhao
School of Marine Science and Technology, Northwestern Polytechnical University, Xi’an 710072, China; Institute of Artificial Intelligence (TeleAI), China Telecom, Beijing 100033, China; Research and Development Institute of Northwestern Polytechnical University in Shenzhen, Shenzhen 518063, China
B
Boyu Zhu
School of Marine Science and Technology, Northwestern Polytechnical University, Xi’an 710072, China; Institute of Artificial Intelligence (TeleAI), China Telecom, Beijing 100033, China; Research and Development Institute of Northwestern Polytechnical University in Shenzhen, Shenzhen 518063, China
Xiao-Lei Zhang
Xiao-Lei Zhang
Professor, Northwestern Polytechnical University, China
Speech ProcessingMachine LearningSignal Processing
X
Xuelong Li
Institute of Artificial Intelligence (TeleAI), China Telecom, Beijing 100033, China