UniForm: A Unified Diffusion Transformer for Audio-Video Generation

📅 2025-02-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing audio-video diffusion generation methods employ separate modality modeling, neglecting cross-modal semantic alignment and temporal dependencies, resulting in poor synchronization and weak inter-modal consistency. To address this, we propose UniForm—a Unified Diffusion Transformer—that introduces, for the first time, a single shared-weight architecture operating in a joint latent space to model audio and video synchronously. UniForm achieves strong cross-modal coupling via modality concatenation and bidirectional conditional guidance (audio ↔ video), eliminating modality decoupling and directly optimizing the joint distribution. This design significantly improves generation quality, lip-sync accuracy, and fine-grained detail fidelity. UniForm establishes new state-of-the-art performance across three tasks: joint audio-video generation, audio-conditioned video generation, and video-conditioned audio generation. By unifying multimodal representation and generation within a single diffusion framework, UniForm introduces a novel paradigm for multimodal diffusion modeling.

Technology Category

Application Category

📝 Abstract
As a natural multimodal content, audible video delivers an immersive sensory experience. Consequently, audio-video generation systems have substantial potential. However, existing diffusion-based studies mainly employ relatively independent modules for generating each modality, which lack exploration of shared-weight generative modules. This approach may under-use the intrinsic correlations between audio and visual modalities, potentially resulting in sub-optimal generation quality. To address this, we propose UniForm, a unified diffusion transformer designed to enhance cross-modal consistency. By concatenating auditory and visual information, UniForm learns to generate audio and video simultaneously within a unified latent space, facilitating the creation of high-quality and well-aligned audio-visual pairs. Extensive experiments demonstrate the superior performance of our method in joint audio-video generation, audio-guided video generation, and video-guided audio generation tasks. Our demos are available at https://uniform-t2av.github.io/.
Problem

Research questions and friction points this paper is trying to address.

Unified audio-video generation
Cross-modal consistency enhancement
Shared-weight generative modules
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified diffusion transformer model
Simultaneous audio-video generation
Shared-weight generative modules
🔎 Similar Papers
L
Lei Zhao
Institute of Artificial Intelligence (TeleAI), China Telecom
Linfeng Feng
Linfeng Feng
Northwestern Polytechnical University
Speech ProcessingMultimodal Learning
D
Dongxu Ge
Institute of Artificial Intelligence (TeleAI), China Telecom
F
Fangqiu Yi
Institute of Artificial Intelligence (TeleAI), China Telecom
C
Chi Zhang
Institute of Artificial Intelligence (TeleAI), China Telecom
Xiao-Lei Zhang
Xiao-Lei Zhang
Professor, Northwestern Polytechnical University, China
Speech ProcessingMachine LearningSignal Processing
X
Xuelong Li
Institute of Artificial Intelligence (TeleAI), China Telecom