Terra: Explorable Native 3D World Model with Point Latents

๐Ÿ“… 2025-10-16
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Existing world models predominantly rely on pixel-aligned representations, neglecting the intrinsic 3D nature of physical scenesโ€”leading to poor 3D consistency and inefficient modeling. This work introduces the first native 3D world model, which constructs a point-based latent space jointly encoding geometry and appearance, enabling efficient, single-pass, multi-view-consistent rendering from arbitrary viewpoints. Key contributions include: (i) the Point-to-Gaussian Variational Autoencoder (P2G-VAE), achieving compact probabilistic encoding of 3D scenes; (ii) the Sparse Point Flow Matching network (SPFlow), the first method to realize progressive generation and precise multi-view consistency directly in point latent space; and (iii) the integration of 3D Gaussian primitives with flow matching for high-fidelity generative rendering. Evaluated on ScanNet v2, our approach achieves state-of-the-art performance in both reconstruction and generation tasks, significantly improving 3D structural consistency and visual fidelity.

Technology Category

Application Category

๐Ÿ“ Abstract
World models have garnered increasing attention for comprehensive modeling of the real world. However, most existing methods still rely on pixel-aligned representations as the basis for world evolution, neglecting the inherent 3D nature of the physical world. This could undermine the 3D consistency and diminish the modeling efficiency of world models. In this paper, we present Terra, a native 3D world model that represents and generates explorable environments in an intrinsic 3D latent space. Specifically, we propose a novel point-to-Gaussian variational autoencoder (P2G-VAE) that encodes 3D inputs into a latent point representation, which is subsequently decoded as 3D Gaussian primitives to jointly model geometry and appearance. We then introduce a sparse point flow matching network (SPFlow) for generating the latent point representation, which simultaneously denoises the positions and features of the point latents. Our Terra enables exact multi-view consistency with native 3D representation and architecture, and supports flexible rendering from any viewpoint with only a single generation process. Furthermore, Terra achieves explorable world modeling through progressive generation in the point latent space. We conduct extensive experiments on the challenging indoor scenes from ScanNet v2. Terra achieves state-of-the-art performance in both reconstruction and generation with high 3D consistency.
Problem

Research questions and friction points this paper is trying to address.

Addressing 3D inconsistency in pixel-aligned world models
Developing native 3D representation using point latent space
Creating explorable environments with multi-view consistent generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Native 3D world model with point latent representation
Point-to-Gaussian VAE encoding geometry and appearance
Sparse point flow matching network generating point latents
๐Ÿ”Ž Similar Papers
No similar papers found.