🤖 AI Summary
Existing methods struggle to generate temporally coherent, long-horizon dynamic 3D content under multimodal 3D representations while supporting topological changes. This work proposes MORPHOS, a novel framework that introduces, for the first time, a unified 4D implicit representation—termed Temporally Structured Latent Variables (T-SLAT)—to jointly model 3D Gaussians, meshes, and radiance fields. Dynamic geometry and appearance are generated frame-by-frame through an autoregressive causal attention mechanism. To mitigate error accumulation over time, the method incorporates a temporal structure enhancement strategy. Extensive experiments demonstrate that MORPHOS achieves state-of-the-art performance in appearance generation across multiple benchmarks, delivers accurate geometric reconstruction, and exhibits strong cross-representation generalization and robustness in long-sequence generation.
📝 Abstract
We present MORPHOS, a novel autoregressive framework that generates dynamic 3D assets from videos across diverse representations, including meshes, 3D Gaussians, and radiance fields. Existing methods are typically limited to a single representation, struggle to model topological changes, or fail to maintain temporal consistency over long videos. To address these limitations, we introduce the Temporal Structured Latents (T-SLAT), a unified 4D representation that jointly encodes geometry and appearance along the temporal dimension. Leveraging T-SLAT, MORPHOS autoregressively generates dynamic 3D assets via causal attention, conditioning each frame on its preceding history to ensure temporal consistency while handling evolving topologies. We also propose a temporal-structural augmentation to mitigate error accumulation in autoregressive generation. MORPHOS achieves state-of-the-art performance in appearance and competitive results in geometry across multiple benchmarks, demonstrating superior generalization across various representations and robustness in long-horizon generation.