🤖 AI Summary
Existing video depth estimation methods rely on affine-invariant predictions, compromising geometric fidelity and limiting performance in metric tasks such as 3D/4D reconstruction and camera calibration. To address this, we propose the first high-fidelity point cloud sequence reconstruction framework for open-world videos. Our approach introduces a Point-VAE that learns a geometry-agnostic latent space, couples it with a conditional video diffusion model to explicitly capture spatiotemporal point cloud distributions, and incorporates geometric-aware latent constraints alongside temporal consistency optimization. Evaluated on multiple benchmarks, our method significantly improves 3D accuracy, cross-domain generalization, and inter-frame consistency. Notably, it achieves end-to-end, temporally consistent point cloud sequence generation without any depth supervision—the first such result—setting new state-of-the-art performance in both reconstruction quality and geometric reliability.
📝 Abstract
Despite remarkable advancements in video depth estimation, existing methods exhibit inherent limitations in achieving geometric fidelity through the affine-invariant predictions, limiting their applicability in reconstruction and other metrically grounded downstream tasks. We propose GeometryCrafter, a novel framework that recovers high-fidelity point map sequences with temporal coherence from open-world videos, enabling accurate 3D/4D reconstruction, camera parameter estimation, and other depth-based applications. At the core of our approach lies a point map Variational Autoencoder (VAE) that learns a latent space agnostic to video latent distributions for effective point map encoding and decoding. Leveraging the VAE, we train a video diffusion model to model the distribution of point map sequences conditioned on the input videos. Extensive evaluations on diverse datasets demonstrate that GeometryCrafter achieves state-of-the-art 3D accuracy, temporal consistency, and generalization capability.