AnimateAnyMesh: A Feed-Forward 4D Foundation Model for Text-Driven Universal Mesh Animation

📅 2025-06-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenges of complex spatiotemporal modeling and scarcity of high-quality 4D training data in text-driven 3D mesh animation generation. We propose DyMeshVAE, the first feed-forward 4D foundation model, which achieves efficient dynamic geometry modeling by disentangling spatial and temporal representations while explicitly preserving mesh topology. To support training, we introduce the first large-scale, text-annotated dynamic mesh dataset comprising 4 million sequences. Our method integrates rectified flow-based latent-space diffusion with a text-mesh cross-modal alignment strategy. DyMeshVAE generates semantically accurate, temporally coherent, high-fidelity mesh animations within seconds—outperforming state-of-the-art methods across generation quality, inference speed, and generalization capability.

Technology Category

Application Category

📝 Abstract
Recent advances in 4D content generation have attracted increasing attention, yet creating high-quality animated 3D models remains challenging due to the complexity of modeling spatio-temporal distributions and the scarcity of 4D training data. In this paper, we present AnimateAnyMesh, the first feed-forward framework that enables efficient text-driven animation of arbitrary 3D meshes. Our approach leverages a novel DyMeshVAE architecture that effectively compresses and reconstructs dynamic mesh sequences by disentangling spatial and temporal features while preserving local topological structures. To enable high-quality text-conditional generation, we employ a Rectified Flow-based training strategy in the compressed latent space. Additionally, we contribute the DyMesh Dataset, containing over 4M diverse dynamic mesh sequences with text annotations. Experimental results demonstrate that our method generates semantically accurate and temporally coherent mesh animations in a few seconds, significantly outperforming existing approaches in both quality and efficiency. Our work marks a substantial step forward in making 4D content creation more accessible and practical. All the data, code, and models will be open-released.
Problem

Research questions and friction points this paper is trying to address.

Creating high-quality animated 3D models is challenging
Existing methods struggle with spatio-temporal distribution complexity
Scarcity of 4D training data limits animation quality
Innovation

Methods, ideas, or system contributions that make the work stand out.

Feed-forward framework for text-driven mesh animation
DyMeshVAE architecture disentangles spatial-temporal features
Rectified Flow-based training in compressed latent space