SALSA-V: Shortcut-Augmented Long-form Synchronized Audio from Videos

📅 2025-10-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenging problem of generating long-duration, high-fidelity, and tightly synchronized audio from silent video. We propose a masked diffusion framework capable of synthesizing audio of arbitrary length; introduce a shortcut loss enabling high-quality generation in only eight sampling steps; and employ stochastic masking during training together with cross-modal conditional modeling to achieve end-to-end audiovisual alignment without fine-tuning. The method significantly improves temporal synchronization and spectral fidelity in professional applications such as Foley sound synthesis and sound design. Quantitative evaluations—including synchronization error, STFT reconstruction quality (e.g., L1 loss in log-magnitude spectrograms), and perceptual metrics—demonstrate consistent superiority over state-of-the-art methods. Subjective listening tests further confirm enhanced naturalness and synchrony. With efficient inference, our approach supports near-real-time deployment.

Technology Category

Application Category

📝 Abstract
We propose SALSA-V, a multimodal video-to-audio generation model capable of synthesizing highly synchronized, high-fidelity long-form audio from silent video content. Our approach introduces a masked diffusion objective, enabling audio-conditioned generation and the seamless synthesis of audio sequences of unconstrained length. Additionally, by integrating a shortcut loss into our training process, we achieve rapid generation of high-quality audio samples in as few as eight sampling steps, paving the way for near-real-time applications without requiring dedicated fine-tuning or retraining. We demonstrate that SALSA-V significantly outperforms existing state-of-the-art methods in both audiovisual alignment and synchronization with video content in quantitative evaluation and a human listening study. Furthermore, our use of random masking during training enables our model to match spectral characteristics of reference audio samples, broadening its applicability to professional audio synthesis tasks such as Foley generation and sound design.
Problem

Research questions and friction points this paper is trying to address.

Generating synchronized long-form audio from silent videos
Achieving rapid high-quality audio synthesis in few steps
Enhancing audio-video alignment and professional sound applications
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal video-to-audio generation with masked diffusion
Shortcut loss enables rapid generation in few steps
Random masking matches spectral characteristics of reference audio
🔎 Similar Papers
No similar papers found.