RealmDreamer: Text-Driven 3D Scene Generation with Inpainting and Depth Diffusion

📅 2024-04-10
🏛️ arXiv.org
📈 Citations: 32
Influential: 1
📄 PDF
🤖 AI Summary
This work addresses the challenging problem of generating high-fidelity forward-facing 3D scenes from a single text description—without video or multi-view supervision. We propose an end-to-end framework that jointly integrates text-to-image diffusion priors, 3D Gaussian splatting initialization, cross-view 3D inpainting, and depth-guided diffusion modeling, augmented by multi-view geometric constraints for joint optimization. The method supports both pure text-driven and single-image-driven 3D synthesis, producing geometrically accurate, texture-rich, layout-complex, and style-controllable multi-object 3D scenes. Experimental results demonstrate substantial improvements in 3D consistency and visual quality under zero multi-view supervision, outperforming prior text-to-3D approaches. Our approach establishes a new paradigm for text-driven forward-facing 3D generation, enabling robust scene synthesis with strong geometric fidelity and semantic alignment.

Technology Category

Application Category

📝 Abstract
We introduce RealmDreamer, a technique for generation of general forward-facing 3D scenes from text descriptions. Our technique optimizes a 3D Gaussian Splatting representation to match complex text prompts. We initialize these splats by utilizing the state-of-the-art text-to-image generators, lifting their samples into 3D, and computing the occlusion volume. We then optimize this representation across multiple views as a 3D inpainting task with image-conditional diffusion models. To learn correct geometric structure, we incorporate a depth diffusion model by conditioning on the samples from the inpainting model, giving rich geometric structure. Finally, we finetune the model using sharpened samples from image generators. Notably, our technique does not require video or multi-view data and can synthesize a variety of high-quality 3D scenes in different styles, consisting of multiple objects. Its generality additionally allows 3D synthesis from a single image.
Problem

Research questions and friction points this paper is trying to address.

Generates 3D scenes from text descriptions using diffusion models.
Leverages 2D inpainting and depth diffusion for scene optimization.
Synthesizes high-quality 3D scenes without video or multi-view data.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses 3D Gaussian Splatting with diffusion models
Leverages 2D inpainting for low variance supervision
Integrates depth diffusion for high-fidelity geometry
🔎 Similar Papers
No similar papers found.