World Tracing: Generative Pixel-Aligned Geometry Beyond the Visible

📅 2026-06-11

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing image-to-3D methods struggle to simultaneously achieve pixel-aligned fidelity and complete scene geometry. To address this, this work proposes World Tracing—a generative, pixel-aligned geometric representation that predicts, for each input pixel, an ordered stack of multi-layer 3D points in camera space, where the first layer corresponds to the visible surface and subsequent layers reconstruct occluded structures. This approach uniquely unifies complete geometric generation with precise pixel alignment while preserving 2D–3D correspondences, enabling text-driven editing, novel view synthesis, and training-free integration with textured mesh generators. Built upon a World Tracing Diffusion Transformer (WT-DiT), the method employs factorized yet globally attentive mechanisms to denoise multi-layer geometry, trained via pixel-space flow matching and a hybrid noise schedule. Experiments across object, scene, and dynamic datasets demonstrate significant improvements over current depth estimation and image-to-3D generation models in both visible surface reconstruction and holistic geometry completion.

📝 Abstract

Image-to-3D methods often trade off faithfulness and completeness: depth estimators are anchored to input pixels but stop at the visible surface, while image-to-3D models generate complete shapes that are often misaligned with the input. We introduce World Tracing, a generative pixel-aligned geometry representation that predicts 3D points aligned with observed pixels while completing geometry beyond the visible surface. For each input pixel, World Tracing predicts an ordered stack of camera-space 3D points, where the first layer represents the visible surface and subsequent layers represent front-to-back intersections with occluded surfaces. We instantiate this representation with a world-tracing diffusion transformer, WT-DiT, which treats multiple geometry layers as separate denoising tokens coupled through factorized and global attention. WT-DiT is trained with pixel-space flow matching and a mixed noise schedule that balances visible-surface reconstruction with occluded-geometry generation. World Tracing achieves strong performance on visible-surface reconstruction and complete geometry generation across object, scene, and dynamic benchmarks, outperforming both depth predictors and image-to-3D generators. It also preserves 2D-to-3D correspondence, enabling text-driven 3D scene editing, geometry-conditioned novel-view video synthesis, and training-free integration with textured-mesh generators.

Problem

Research questions and friction points this paper is trying to address.

pixel-aligned geometry

image-to-3D

occluded geometry completion

visible surface reconstruction

3D generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

World Tracing

pixel-aligned geometry

occluded surface completion