ViSAudio: End-to-End Video-Driven Binaural Spatial Audio Generation

📅 2025-12-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing video-to-audio methods predominantly generate monaural audio, lacking spatial immersion; binaural audio generation typically relies on a two-stage pipeline—monaural synthesis followed by spatialization—leading to error accumulation and compromised spatiotemporal consistency. This work introduces, for the first time, an end-to-end video-driven binaural audio generation task. We construct the large-scale BiAudio dataset and propose a dual-branch latent-variable architecture with a conditional spatiotemporal module to jointly model left and right channels. Leveraging conditional flow matching in latent space, our method explicitly models viewpoint dynamics, sound-source motion, and complex acoustic environments. Experiments demonstrate state-of-the-art performance in both objective metrics and subjective listening evaluations, significantly improving audio fidelity and spatial immersion.

Technology Category

Application Category

📝 Abstract
Despite progress in video-to-audio generation, the field focuses predominantly on mono output, lacking spatial immersion. Existing binaural approaches remain constrained by a two-stage pipeline that first generates mono audio and then performs spatialization, often resulting in error accumulation and spatio-temporal inconsistencies. To address this limitation, we introduce the task of end-to-end binaural spatial audio generation directly from silent video. To support this task, we present the BiAudio dataset, comprising approximately 97K video-binaural audio pairs spanning diverse real-world scenes and camera rotation trajectories, constructed through a semi-automated pipeline. Furthermore, we propose ViSAudio, an end-to-end framework that employs conditional flow matching with a dual-branch audio generation architecture, where two dedicated branches model the audio latent flows. Integrated with a conditional spacetime module, it balances consistency between channels while preserving distinctive spatial characteristics, ensuring precise spatio-temporal alignment between audio and the input video. Comprehensive experiments demonstrate that ViSAudio outperforms existing state-of-the-art methods across both objective metrics and subjective evaluations, generating high-quality binaural audio with spatial immersion that adapts effectively to viewpoint changes, sound-source motion, and diverse acoustic environments. Project website: https://kszpxxzmc.github.io/ViSAudio-project.
Problem

Research questions and friction points this paper is trying to address.

Generates binaural audio directly from video end-to-end
Overcomes two-stage pipeline errors and spatio-temporal misalignment
Ensures audio-video alignment across motion and acoustic scenes
Innovation

Methods, ideas, or system contributions that make the work stand out.

End-to-end binaural audio generation from silent video
Dual-branch architecture with conditional flow matching
Balances channel consistency and spatial characteristics
🔎 Similar Papers
No similar papers found.
Mengchen Zhang
Mengchen Zhang
Zhejiang University
Computer VisionAIGC
Q
Qi Chen
Shanghai Jiao Tong University, Shanghai Innovation Institute
T
Tong Wu
Stanford University
Z
Zihan Liu
Beihang University, Shanghai Artificial Intelligence Laboratory
Dahua Lin
Dahua Lin
The Chinese University of Hong Kong
computer visionmachine learningprobabilistic inferencebayesian nonparametrics