Tri-Ergon: Fine-grained Video-to-Audio Generation with Multi-modal Conditions and LUFS Control

📅 2024-12-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing video-to-audio methods suffer from inadequate loudness dynamic control and weak multimodal co-modeling, resulting in low audio naturalness and poor compatibility with professional Foley production. This paper proposes the first fine-grained video-to-audio generation framework incorporating LUFS (Loudness Units Full Scale) embedding, conditioned jointly on textual, auditory, and pixel-level visual modalities. Leveraging a diffusion-based architecture, it achieves variable-length (up to 60 seconds), 44.1 kHz high-fidelity stereo audio generation via multimodal feature alignment, cross-modal attention fusion, and pixel-level visual encoding. Crucially, LUFS is introduced as a time-varying loudness control signal directly embedded into the generative process—enabling unprecedented channel-wise loudness accuracy and dynamic expressiveness. Experiments demonstrate that our method outperforms state-of-the-art approaches across audio fidelity, channel count, temporal flexibility, and loudness controllability.

Technology Category

Application Category

📝 Abstract
Video-to-audio (V2A) generation utilizes visual-only video features to produce realistic sounds that correspond to the scene. However, current V2A models often lack fine-grained control over the generated audio, especially in terms of loudness variation and the incorporation of multi-modal conditions. To overcome these limitations, we introduce Tri-Ergon, a diffusion-based V2A model that incorporates textual, auditory, and pixel-level visual prompts to enable detailed and semantically rich audio synthesis. Additionally, we introduce Loudness Units relative to Full Scale (LUFS) embedding, which allows for precise manual control of the loudness changes over time for individual audio channels, enabling our model to effectively address the intricate correlation of video and audio in real-world Foley workflows. Tri-Ergon is capable of creating 44.1 kHz high-fidelity stereo audio clips of varying lengths up to 60 seconds, which significantly outperforms existing state-of-the-art V2A methods that typically generate mono audio for a fixed duration.
Problem

Research questions and friction points this paper is trying to address.

Audio-Video Conversion
Volume Control
Multimodal Information Fusion
Innovation

Methods, ideas, or system contributions that make the work stand out.

Tri-Ergon Method
Audio Precision
Stereo Sound Quality
🔎 Similar Papers
No similar papers found.
B
Bingliang Li
vivo Mobile Communication Co., Ltd, The Chinese University of Hong Kong, Shenzhen
F
Fengyu Yang
The Chinese University of Hong Kong, Shenzhen
Y
Yuxin Mao
Northwestern Polytechnical University
Q
Qingwen Ye
vivo Mobile Communication Co., Ltd
H
Hongkai Chen
vivo Mobile Communication Co., Ltd
Yiran Zhong
Yiran Zhong
PhD, Australian National University
LLMSelf-supervised LearningVisual Geometry LearningNatural Language ProcessingMultimodal