🤖 AI Summary
To address the challenge of generating semantically consistent and temporally aligned Foley sounds from silent videos, this paper proposes a video-text joint-driven controllable sound synthesis method. To overcome the limitations of implicit acoustic priors in visual features, we introduce, for the first time, a lightweight modality adapter that fuses dual guidance from video and text inputs, enhanced by contrastive learning and alignment loss to optimize cross-modal semantic consistency. Our approach ensures precise audiovisual synchronization (temporal error < 0.12 s) while significantly improving text controllability—achieving a 37% higher instruction-following rate than the state of the art—and enabling fine-grained editing of sound attributes. Comprehensive subjective and objective evaluations demonstrate a Mean Opinion Score (MOS) of 4.21, confirming superior overall performance over current best methods.
📝 Abstract
Foley is a key element in video production, refers to the process of adding an audio signal to a silent video while ensuring semantic and temporal alignment. In recent years, the rise of personalized content creation and advancements in automatic video-to-audio models have increased the demand for greater user control in the process. One possible approach is to incorporate text to guide audio generation. While supported by existing methods, challenges remain in ensuring compatibility between modalities, particularly when the text introduces additional information or contradicts the sounds naturally inferred from the visuals. In this work, we introduce CAFA (Controllable Automatic Foley Artist) a video-and-text-to-audio model that generates semantically and temporally aligned audio for a given video, guided by text input. CAFA is built upon a text-to-audio model and integrates video information through a modality adapter mechanism. By incorporating text, users can refine semantic details and introduce creative variations, guiding the audio synthesis beyond the expected video contextual cues. Experiments show that besides its superior quality in terms of semantic alignment and audio-visual synchronization the proposed method enable high textual controllability as demonstrated in subjective and objective evaluations.