Drive Any Mesh: 4D Latent Diffusion for Mesh Deformation from Video

📅 2025-06-09

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

Existing 4D generation methods face two key limitations: implicit representations suffer from low rendering efficiency and incompatibility with rasterization engines, while skeleton-based approaches rely on manual rigging and exhibit poor generalization. To address the challenge of generating arbitrary 3D mesh animations from monocular video input, this paper proposes the first 4D latent diffusion framework that jointly models 3D shape and spatiotemporal motion. Our contributions are threefold: (1) a Transformer-based VQVAE learns category-agnostic mesh latent codes, enabling zero-shot driving; (2) spatiotemporal diffusion over point-cloud trajectory sequences achieves millisecond-level synthesis of high-fidelity animated meshes; and (3) explicit triangular meshes are generated natively compatible with mainstream real-time rendering engines. Experiments demonstrate significant improvements over implicit and skeletal baselines under complex motions, achieving superior efficiency, generalizability, and practical deployability.

Technology Category

Application Category

📝 Abstract

We propose DriveAnyMesh, a method for driving mesh guided by monocular video. Current 4D generation techniques encounter challenges with modern rendering engines. Implicit methods have low rendering efficiency and are unfriendly to rasterization-based engines, while skeletal methods demand significant manual effort and lack cross-category generalization. Animating existing 3D assets, instead of creating 4D assets from scratch, demands a deep understanding of the input's 3D structure. To tackle these challenges, we present a 4D diffusion model that denoises sequences of latent sets, which are then decoded to produce mesh animations from point cloud trajectory sequences. These latent sets leverage a transformer-based variational autoencoder, simultaneously capturing 3D shape and motion information. By employing a spatiotemporal, transformer-based diffusion model, information is exchanged across multiple latent frames, enhancing the efficiency and generalization of the generated results. Our experimental results demonstrate that DriveAnyMesh can rapidly produce high-quality animations for complex motions and is compatible with modern rendering engines. This method holds potential for applications in both the gaming and filming industries.

Problem

Research questions and friction points this paper is trying to address.

Enables mesh deformation from monocular video input

Improves rendering efficiency for modern engines

Generalizes across categories without manual effort

Innovation

Methods, ideas, or system contributions that make the work stand out.

4D diffusion model denoises latent sets

Transformer VAE captures shape and motion

Spatiotemporal diffusion enhances efficiency

🔎 Similar Papers

DiffMesh: A Motion-aware Diffusion Framework for Human Mesh Recovery from Videos