ProAV-DiT: A Projected Latent Diffusion Transformer for Efficient Synchronized Audio-Video Generation

📅 2025-11-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address structural misalignment between audio and video modalities and the high computational cost of multimodal generation, this paper proposes an efficient synchronized audio-video generation framework. The method employs a latent-space diffusion Transformer for end-to-end generation. Key contributions include: (1) a multi-scale dual-stream spatiotemporal autoencoder that maps audio to video-like representations, establishing a unified 3D latent space; (2) an orthogonal decomposition mechanism to disentangle modality-specific dynamics from shared temporal dynamics; and (3) a hybrid attention architecture integrating multi-scale temporal self-attention and grouped cross-modal attention to enhance temporal coherence and fine-grained inter-modal interaction. Evaluated on multiple standard benchmarks, the approach achieves state-of-the-art performance in synchronized audio-video generation fidelity while reducing computational cost by approximately 37% in FLOPs.

Technology Category

Application Category

📝 Abstract
Sounding Video Generation (SVG) remains a challenging task due to the inherent structural misalignment between audio and video, as well as the high computational cost of multimodal data processing. In this paper, we introduce ProAV-DiT, a Projected Latent Diffusion Transformer designed for efficient and synchronized audio-video generation. To address structural inconsistencies, we preprocess raw audio into video-like representations, aligning both the temporal and spatial dimensions between audio and video. At its core, ProAV-DiT adopts a Multi-scale Dual-stream Spatio-Temporal Autoencoder (MDSA), which projects both modalities into a unified latent space using orthogonal decomposition, enabling fine-grained spatiotemporal modeling and semantic alignment. To further enhance temporal coherence and modality-specific fusion, we introduce a multi-scale attention mechanism, which consists of multi-scale temporal self-attention and group cross-modal attention. Furthermore, we stack the 2D latents from MDSA into a unified 3D latent space, which is processed by a spatio-temporal diffusion Transformer. This design efficiently models spatiotemporal dependencies, enabling the generation of high-fidelity synchronized audio-video content while reducing computational overhead. Extensive experiments conducted on standard benchmarks demonstrate that ProAV-DiT outperforms existing methods in both generation quality and computational efficiency.
Problem

Research questions and friction points this paper is trying to address.

Addresses structural misalignment between audio and video modalities
Reduces high computational cost of multimodal data processing
Enhances temporal coherence in synchronized audio-video generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Projected latent diffusion transformer for synchronized audio-video generation
Multi-scale dual-stream autoencoder with orthogonal latent decomposition
Multi-scale attention mechanism with spatio-temporal diffusion modeling
🔎 Similar Papers
No similar papers found.
Jiahui Sun
Jiahui Sun
Shanghai Jiao Tong University
System
W
Weining Wang
Institute of Automation, Chinese Academy of Sciences
M
Mingzhen Sun
Institute of Automation, Chinese Academy of Sciences
Y
Yirong Yang
Beihang University
X
Xinxin Zhu
Institute of Automation, Chinese Academy of Sciences
J
Jing Liu
Institute of Automation, Chinese Academy of Sciences