Towards 3D-Aware Video Diffusion Models: Render-Free Human Motion Control with Mesh Tokenization

📅 2026-06-01

📈 Citations: 0

✨ Influential: 0

career value

227K/year

🤖 AI Summary

Do current video diffusion models genuinely understand the underlying 3D structure of visual observations, or do they merely generate plausible 2D projections? This work proposes a 3D-aware human motion control framework that bypasses 2D rendering by directly tokenizing compressed 3D human meshes and jointly modeling them with video tokens within a DiT architecture. This unified approach enables coherent generation of geometry, motion, camera viewpoint, and scene context while preserving full 3D geometric information. By facilitating joint reasoning over appearance, structure, and viewpoint, the method significantly outperforms existing approaches on human motion control benchmarks and effectively mitigates editing artifacts caused by 2D guidance’s view dependency and misalignment between trajectories and poses.

📝 Abstract

Diffusion models have shown remarkable success in video generation. However, whether such models are truly aware of the 3D structure underlying visual observations, rather than simply reproducing plausible 2D projections, remains an open question. In this work, we investigate this question through human motion control, a task that requires precise modelling of 3D human geometry, motion, camera viewpoint, and scene context. Unlike prior methods that rely on rendered 2D motion guidance videos, we propose a render-free framework that conditions video generation directly on compressed 3D human mesh tokens. This representation preserves full 3D geometric information while enabling a unified token-based generation pipeline that processes video tokens jointly with motion tokens in a DiT-based architecture. This design requires the model to reason jointly about appearance, 3D structure, and camera viewpoint during video generation. Experimental results demonstrate strong performance on human motion control benchmarks, while reducing artifacts induced by view-dependent 2D guidance and trajectory-pose mismatches during editing. These findings suggest that video diffusion models, when equipped with mesh tokenization, can better capture complex 3D human structures and their interactions with the surrounding environment.

Problem

Research questions and friction points this paper is trying to address.

3D-awareness

video diffusion models

human motion control

3D human mesh

render-free

Innovation

Methods, ideas, or system contributions that make the work stand out.

3D-aware diffusion

mesh tokenization

render-free control