🤖 AI Summary
This work addresses the challenges in joint audio-visual generation, where semantic alignment, perceptual quality, and audio-visual synchronization are difficult to optimize simultaneously, often requiring prohibitively expensive training. The study introduces inference-time scaling to multimodal generation for the first time, proposing a training-free multi-objective optimization framework. By leveraging collaborative guidance from multiple verifiers and an adaptive reward weighting (ARW) mechanism, the method dynamically aggregates online reward signals during inference to achieve balanced improvements across generation objectives. Experiments on VGGSound and JavisBench-mini demonstrate that the approach significantly enhances semantic consistency, audio-visual synchronization, and overall perceptual quality of generated content without any additional training.
📝 Abstract
Joint audio-video generation aims to synthesize realistic audio-video pairs that are both semantically aligned with text prompts and precisely synchronized. While existing joint audio-video generation models often require substantial training resources to improve fidelity, Inference-Time Scaling (ITS) has recently emerged as a promising training-free alternative in single-modality domains. However, extending ITS from a single modality to multimodal domains is non-trivial, as it requires balancing multiple heterogeneous objectives. In this paper, we present the first comprehensive study of ITS for joint audio-video generation. We first demonstrate that a multi-verifier framework is essential to address the limitations of single-objective guidance, including asymmetric performance trade-offs and verifier hacking. Through systematic analysis, we then identify an optimal multi-verifier combination that yields balanced improvements across all quality dimensions. Finally, to effectively aggregate diverse reward signals, we propose Adaptive Reward Weighting (ARW), a novel test-time optimization algorithm. ARW treats reward aggregation as an online optimization problem, utilizing learnable parameters to calibrate reward variances without requiring prior knowledge of reward distributions, thereby ensuring robust multi-objective selection. Experimental results on VGGSound and JavisBench-mini benchmarks demonstrate that our framework significantly enhances semantic alignment, perceptual quality, and audio-visual synchronization of generated outputs. Synthesized samples and code are available on the project page: https://jung-jaemin.github.io/ITS-AVGen-Proj.