DualSpec: Text-to-spatial-audio Generation via Dual-Spectrogram Guided Diffusion Model

📅 2025-02-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the unsolved problem of text-to-spatial-audio generation by introducing the first end-to-end framework for text-driven 3D auditory content synthesis. Methodologically, we propose a dual-spectrogram-guided diffusion architecture that jointly optimizes audio fidelity via Mel spectrograms and spatial localization accuracy via STFT spectrograms; it incorporates a VAE latent-space representation, large language model–based text encoding, and a dual-conditioned diffusion process. Key contributions include: (1) SpatialAudio-1K—the first publicly available text–spatial-audio dataset with azimuth annotations; (2) novel spatial-aware evaluation metrics; and (3) state-of-the-art performance, significantly outperforming baselines in event consistency and azimuth error, enabling high-fidelity binaural or stereo audio generation directly from text input with precise horizontal-angle localization.

Technology Category

Application Category

📝 Abstract
Text-to-audio (TTA), which generates audio signals from textual descriptions, has received huge attention in recent years. However, recent works focused on text to monaural audio only. As we know, spatial audio provides more immersive auditory experience than monaural audio, e.g. in virtual reality. To address this issue, we propose a text-to-spatial-audio (TTSA) generation framework named DualSpec.Specifically, it first trains variational autoencoders (VAEs) for extracting the latent acoustic representations from sound event audio. Then, given text that describes sound events and event directions, the proposed method uses the encoder of a pretrained large language model to transform the text into text features. Finally, it trains a diffusion model from the latent acoustic representations and text features for the spatial audio generation. In the inference stage, only the text description is needed to generate spatial audio. Particularly, to improve the synthesis quality and azimuth accuracy of the spatial sound events simultaneously, we propose to use two kinds of acoustic features. One is the Mel spectrograms which is good for improving the synthesis quality, and the other is the short-time Fourier transform spectrograms which is good at improving the azimuth accuracy. We provide a pipeline of constructing spatial audio dataset with text prompts, for the training of the VAEs and diffusion model. We also introduce new spatial-aware evaluation metrics to quantify the azimuth errors of the generated spatial audio recordings. Experimental results demonstrate that the proposed method can generate spatial audio with high directional and event consistency.
Problem

Research questions and friction points this paper is trying to address.

Generates spatial audio from text descriptions
Improves synthesis quality and azimuth accuracy
Introduces new spatial-aware evaluation metrics
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dual-spectrogram guided diffusion model
Variational autoencoders for acoustic representation
Large language model for text features
🔎 Similar Papers
L
Lei Zhao
School of Marine Science and Technology, Northwestern Polytechnical University, Xi’an 710072, China; Institute of Artificial Intelligence (TeleAI), China Telecom, P. R. China; Research and Development Institute of Northwestern Polytechnical University in Shenzhen, China
S
Sizhou Chen
College of Artificial Intelligence, Chengdu University of Information Technology, Chengdu, Sichuan 610225, China; Institute of Artificial Intelligence (TeleAI), China Telecom, P. R. China
Linfeng Feng
Linfeng Feng
Northwestern Polytechnical University
Speech ProcessingMultimodal Learning
Xiao-Lei Zhang
Xiao-Lei Zhang
Professor, Northwestern Polytechnical University, China
Speech ProcessingMachine LearningSignal Processing
X
Xuelong Li
Institute of Artificial Intelligence (TeleAI), China Telecom, P. R. China