Foley-Omni: A Unified Multimodal Generation Model from Task-Level Audio Synthesis to Complete Video Soundtrack Generation

📅 2026-06-02

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

This work addresses the limitation of existing audio generation models, which are typically confined to single tasks and struggle to jointly synthesize the complete audio track—comprising speech, sound effects, and music—required for video. The paper proposes the first unified multimodal audio generation framework that jointly models these three audio types through a shared latent space, integrating audio-visual alignment techniques and a novel data curation pipeline to enable end-to-end video-driven soundtrack generation. Key contributions include a unified generative architecture, an accompanying data processing pipeline, and the release of V2ST-Bench, a new evaluation benchmark. Experiments demonstrate that the proposed method matches or approaches the performance of specialized models on individual tasks while significantly improving intelligibility, audiovisual consistency, and subjective audio quality in composite soundtracks.

📝 Abstract

Recent unified audio generation models can support diverse tasks across speech, sound effects, and music, but most of them still focus on isolated task-level synthesis. However, real video production often requires multiple components of a complete audio track to be generated jointly and consistently for the same video. We present Foley-Omni, a unified multimodal audio generation model that extends isolated task-level synthesis to complete video soundtrack generation by jointly modeling speech, sound effects, and music within a shared latent generation process. To support training and reproducible evaluation, we develop an audiovisual data curation pipeline and introduce V2ST-Bench, a benchmark for holistic video soundtrack generation evaluation. Experiments show that Foley-Omni achieves competitive performance with expert systems on individual synthesis tasks, while improving speech intelligibility, audiovisual consistency and perceptual quality for mixed soundtrack generation.

Problem

Research questions and friction points this paper is trying to address.

unified audio generation

video soundtrack generation

multimodal generation

audiovisual consistency

task-level synthesis

Innovation

Methods, ideas, or system contributions that make the work stand out.

unified multimodal generation

complete video soundtrack generation

joint audio modeling