Wonderland: Navigating 3D Scenes from a Single Image

📅 2024-12-16
🏛️ arXiv.org
📈 Citations: 11
Influential: 3
📄 PDF
🤖 AI Summary
Existing methods for high-fidelity, large-scale 3D scene reconstruction from a single image rely on multi-view inputs and per-scene optimization, suffering from geometric distortions in occluded regions and blurry background textures. Method: We propose the first feed-forward 3D Gaussian Splatting prediction framework grounded in the latent space of video diffusion models, transferring video priors to single-image 3D reconstruction—enabling optimization-free, multi-view-free, real-time 3D generation. Our approach leverages latent-space alignment, progressive training, and explicit 3D consistency constraints to jointly enhance geometric completeness and texture fidelity. Results: Our method achieves state-of-the-art performance across multiple benchmarks, demonstrates strong out-of-distribution generalization, and supports wide-baseline navigation and high-fidelity novel-view rendering.

Technology Category

Application Category

📝 Abstract
How can one efficiently generate high-quality, wide-scope 3D scenes from arbitrary single images? Existing methods suffer several drawbacks, such as requiring multi-view data, time-consuming per-scene optimization, distorted geometry in occluded areas, and low visual quality in backgrounds. Our novel 3D scene reconstruction pipeline overcomes these limitations to tackle the aforesaid challenge. Specifically, we introduce a large-scale reconstruction model that leverages latents from a video diffusion model to predict 3D Gaussian Splattings of scenes in a feed-forward manner. The video diffusion model is designed to create videos precisely following specified camera trajectories, allowing it to generate compressed video latents that encode multi-view information while maintaining 3D consistency. We train the 3D reconstruction model to operate on the video latent space with a progressive learning strategy, enabling the efficient generation of high-quality, wide-scope, and generic 3D scenes. Extensive evaluations across various datasets affirm that our model significantly outperforms existing single-view 3D scene generation methods, especially with out-of-domain images. Thus, we demonstrate for the first time that a 3D reconstruction model can effectively be built upon the latent space of a diffusion model in order to realize efficient 3D scene generation.
Problem

Research questions and friction points this paper is trying to address.

Generate high-quality 3D scenes from single images
Overcome limitations like multi-view data dependency
Achieve efficient wide-scope 3D reconstruction
Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages video diffusion model latents
Predicts 3D Gaussian Splattings feed-forward
Progressive learning for efficient generation
🔎 Similar Papers
Hanwen Liang
Hanwen Liang
University of Toronto
J
Junli Cao
University of California, Los Angeles
Vidit Goel
Vidit Goel
Snap Inc
Computer visionDeep Learning
G
Guocheng Qian
Snap Inc.
Sergei Korolev
Sergei Korolev
Snap Inc.
machine learningdata sciencedeep learning
D
D. Terzopoulos
University of California, Los Angeles
K
Konstantinos N. Plataniotis
University of Toronto
Sergey Tulyakov
Sergey Tulyakov
Director of Research, Snap Inc.
computer visionmachine learning
J
Jian Ren
Snap Inc.