Drive Any Mesh: 4D Latent Diffusion for Mesh Deformation from Video

📅 2025-06-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing 4D generation methods face two key limitations: implicit representations suffer from low rendering efficiency and incompatibility with rasterization engines, while skeleton-based approaches rely on manual rigging and exhibit poor generalization. To address the challenge of generating arbitrary 3D mesh animations from monocular video input, this paper proposes the first 4D latent diffusion framework that jointly models 3D shape and spatiotemporal motion. Our contributions are threefold: (1) a Transformer-based VQVAE learns category-agnostic mesh latent codes, enabling zero-shot driving; (2) spatiotemporal diffusion over point-cloud trajectory sequences achieves millisecond-level synthesis of high-fidelity animated meshes; and (3) explicit triangular meshes are generated natively compatible with mainstream real-time rendering engines. Experiments demonstrate significant improvements over implicit and skeletal baselines under complex motions, achieving superior efficiency, generalizability, and practical deployability.

Technology Category

Application Category

📝 Abstract
We propose DriveAnyMesh, a method for driving mesh guided by monocular video. Current 4D generation techniques encounter challenges with modern rendering engines. Implicit methods have low rendering efficiency and are unfriendly to rasterization-based engines, while skeletal methods demand significant manual effort and lack cross-category generalization. Animating existing 3D assets, instead of creating 4D assets from scratch, demands a deep understanding of the input's 3D structure. To tackle these challenges, we present a 4D diffusion model that denoises sequences of latent sets, which are then decoded to produce mesh animations from point cloud trajectory sequences. These latent sets leverage a transformer-based variational autoencoder, simultaneously capturing 3D shape and motion information. By employing a spatiotemporal, transformer-based diffusion model, information is exchanged across multiple latent frames, enhancing the efficiency and generalization of the generated results. Our experimental results demonstrate that DriveAnyMesh can rapidly produce high-quality animations for complex motions and is compatible with modern rendering engines. This method holds potential for applications in both the gaming and filming industries.
Problem

Research questions and friction points this paper is trying to address.

Enables mesh deformation from monocular video input
Improves rendering efficiency for modern engines
Generalizes across categories without manual effort
Innovation

Methods, ideas, or system contributions that make the work stand out.

4D diffusion model denoises latent sets
Transformer VAE captures shape and motion
Spatiotemporal diffusion enhances efficiency
🔎 Similar Papers
No similar papers found.
Yahao Shi
Yahao Shi
Beihang University
computer graphics
Y
Yang Liu
State Key Laboratory of Software Development Environment, Beihang University
Y
Yanmin Wu
School of Electronic and Computer Engineering, Peking University
X
Xing Liu
Baidu VIS
C
Chen Zhao
Baidu VIS
J
Jie Luo
State Key Laboratory of Software Development Environment, Beihang University
B
Bin Zhou
State Key Laboratory of Virtual Reality Technology and Systems, Beihang University