🤖 AI Summary
This work addresses the limitation of existing audio generation models, which are typically confined to single tasks and struggle to jointly synthesize the complete audio track—comprising speech, sound effects, and music—required for video. The paper proposes the first unified multimodal audio generation framework that jointly models these three audio types through a shared latent space, integrating audio-visual alignment techniques and a novel data curation pipeline to enable end-to-end video-driven soundtrack generation. Key contributions include a unified generative architecture, an accompanying data processing pipeline, and the release of V2ST-Bench, a new evaluation benchmark. Experiments demonstrate that the proposed method matches or approaches the performance of specialized models on individual tasks while significantly improving intelligibility, audiovisual consistency, and subjective audio quality in composite soundtracks.
📝 Abstract
Recent unified audio generation models can support diverse tasks across speech, sound effects, and music, but most of them still focus on isolated task-level synthesis. However, real video production often requires multiple components of a complete audio track to be generated jointly and consistently for the same video. We present Foley-Omni, a unified multimodal audio generation model that extends isolated task-level synthesis to complete video soundtrack generation by jointly modeling speech, sound effects, and music within a shared latent generation process. To support training and reproducible evaluation, we develop an audiovisual data curation pipeline and introduce V2ST-Bench, a benchmark for holistic video soundtrack generation evaluation. Experiments show that Foley-Omni achieves competitive performance with expert systems on individual synthesis tasks, while improving speech intelligibility, audiovisual consistency and perceptual quality for mixed soundtrack generation.