Terra: Explorable Native 3D World Model with Point Latents

📅 2025-10-16

📈 Citations: 0

✨ Influential: 0

career value

208K/year

🤖 AI Summary

Existing world models predominantly rely on pixel-aligned representations, neglecting the intrinsic 3D nature of physical scenes—leading to poor 3D consistency and inefficient modeling. This work introduces the first native 3D world model, which constructs a point-based latent space jointly encoding geometry and appearance, enabling efficient, single-pass, multi-view-consistent rendering from arbitrary viewpoints. Key contributions include: (i) the Point-to-Gaussian Variational Autoencoder (P2G-VAE), achieving compact probabilistic encoding of 3D scenes; (ii) the Sparse Point Flow Matching network (SPFlow), the first method to realize progressive generation and precise multi-view consistency directly in point latent space; and (iii) the integration of 3D Gaussian primitives with flow matching for high-fidelity generative rendering. Evaluated on ScanNet v2, our approach achieves state-of-the-art performance in both reconstruction and generation tasks, significantly improving 3D structural consistency and visual fidelity.

Technology Category

Application Category

📝 Abstract

World models have garnered increasing attention for comprehensive modeling of the real world. However, most existing methods still rely on pixel-aligned representations as the basis for world evolution, neglecting the inherent 3D nature of the physical world. This could undermine the 3D consistency and diminish the modeling efficiency of world models. In this paper, we present Terra, a native 3D world model that represents and generates explorable environments in an intrinsic 3D latent space. Specifically, we propose a novel point-to-Gaussian variational autoencoder (P2G-VAE) that encodes 3D inputs into a latent point representation, which is subsequently decoded as 3D Gaussian primitives to jointly model geometry and appearance. We then introduce a sparse point flow matching network (SPFlow) for generating the latent point representation, which simultaneously denoises the positions and features of the point latents. Our Terra enables exact multi-view consistency with native 3D representation and architecture, and supports flexible rendering from any viewpoint with only a single generation process. Furthermore, Terra achieves explorable world modeling through progressive generation in the point latent space. We conduct extensive experiments on the challenging indoor scenes from ScanNet v2. Terra achieves state-of-the-art performance in both reconstruction and generation with high 3D consistency.

Problem

Research questions and friction points this paper is trying to address.

Addressing 3D inconsistency in pixel-aligned world models

Developing native 3D representation using point latent space

Creating explorable environments with multi-view consistent generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Native 3D world model with point latent representation

Point-to-Gaussian VAE encoding geometry and appearance

Sparse point flow matching network generating point latents

🔎 Similar Papers

When LLMs step into the 3D World: A Survey and Meta-Analysis of 3D Tasks via Multi-modal Large Language Models