FoleySpace: Vision-Aligned Binaural Spatial Audio Generation

📅 2025-08-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing video-to-audio (V2A) methods predominantly generate monaural audio, lacking spatial awareness and thus failing to support immersive binaural audio synthesis. This work introduces FoleySpace, the first framework to integrate vision-estimated 3D sound source trajectories with a pre-trained monaural diffusion model. It achieves spatially aligned binaural audio generation via three key components: (i) 2D coordinate + depth estimation from video, (ii) HRTF-driven coordinate mapping to binaural parameter space, and (iii) cross-modal joint modeling. Crucially, FoleySpace requires no ground-truth binaural annotations—only a pre-trained monaural diffusion model and visual cues suffice for high-fidelity spatialization. Quantitative and perceptual evaluations demonstrate significant improvements over state-of-the-art V2A methods in sound source localization consistency and binaural cue fidelity (e.g., interaural time/level differences). The approach substantially enhances audiovisual immersion and presence, enabling realistic spatial audio synthesis directly from video.

Technology Category

Application Category

📝 Abstract
Recently, with the advancement of AIGC, deep learning-based video-to-audio (V2A) technology has garnered significant attention. However, existing research mostly focuses on mono audio generation that lacks spatial perception, while the exploration of binaural spatial audio generation technologies, which can provide a stronger sense of immersion, remains insufficient. To solve this problem, we propose FoleySpace, a framework for video-to-binaural audio generation that produces immersive and spatially consistent stereo sound guided by visual information. Specifically, we develop a sound source estimation method to determine the sound source 2D coordinates and depth in each video frame, and then employ a coordinate mapping mechanism to convert the 2D source positions into a 3D trajectory. This 3D trajectory, together with the monaural audio generated by a pre-trained V2A model, serves as a conditioning input for a diffusion model to generate spatially consistent binaural audio. To support the generation of dynamic sound fields, we constructed a training dataset based on recorded Head-Related Impulse Responses that includes various sound source movement scenarios. Experimental results demonstrate that the proposed method outperforms existing approaches in spatial perception consistency, effectively enhancing the immersive quality of the audio-visual experience.
Problem

Research questions and friction points this paper is trying to address.

Generates immersive binaural audio from video
Estimates sound source positions for spatial accuracy
Enhances audio-visual experience with dynamic sound fields
Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision-aligned binaural audio generation framework
Sound source estimation with 2D coordinates and depth
Diffusion model for spatially consistent audio
🔎 Similar Papers
L
Lei Zhao
School of Marine Science and Technology, Northwestern Polytechnical University, Xi’an 710072, China, also with the Institute of Artificial Intelligence (TeleAI), China Telecom, P. R. China, and also with the Research and Development Institute of Northwestern Polytechnical University in Shenzhen, China
R
Rujin Chen
School of Marine Science and Technology, Northwestern Polytechnical University, Xi’an 710072, China, also with the Institute of Artificial Intelligence (TeleAI), China Telecom, P. R. China, and also with the Research and Development Institute of Northwestern Polytechnical University in Shenzhen, China
C
Chi Zhang
Institute of Artificial Intelligence (TeleAI), China Telecom, P. R. China
Xiao-Lei Zhang
Xiao-Lei Zhang
Professor, Northwestern Polytechnical University, China
Speech ProcessingMachine LearningSignal Processing
X
Xuelong Li
Institute of Artificial Intelligence (TeleAI), China Telecom, P. R. China