🤖 AI Summary
This work addresses the challenging problem of reconstructing large-scale, geometrically accurate, and texture-complete 3D scenes from a single input image. We propose the first training-free, self-evolving single-image-to-3D framework that transcends conventional object-centric paradigms to enable large-scale scene reconstruction. Our method synergistically integrates geometric reasoning from 3D generative models with visual priors from video diffusion models, establishing a three-stage cross-domain iterative optimization pipeline: (i) spatial-prior-guided coarse mesh initialization, (ii) vision-guided fine-grained 3D mesh generation, and (iii) spatially constrained novel-view synthesis with inter-view consistency regularization. By orchestrating multi-model collaborative inference and joint 2D/3D cross-domain optimization, our approach significantly improves geometric stability, inter-view texture consistency, and occluded-region completion. The output is a high-fidelity, render-ready triangular mesh.
📝 Abstract
Generating high-quality, textured 3D scenes from a single image remains a fundamental challenge in vision and graphics. Recent image-to-3D generators recover reasonable geometry from single views, but their object-centric training limits generalization to complex, large-scale scenes with faithful structure and texture. We present EvoScene, a self-evolving, training-free framework that progressively reconstructs complete 3D scenes from single images. The key idea is combining the complementary strengths of existing models: geometric reasoning from 3D generation models and visual knowledge from video generation models. Through three iterative stages--Spatial Prior Initialization, Visual-guided 3D Scene Mesh Generation, and Spatial-guided Novel View Generation--EvoScene alternates between 2D and 3D domains, gradually improving both structure and appearance. Experiments on diverse scenes demonstrate that EvoScene achieves superior geometric stability, view-consistent textures, and unseen-region completion compared to strong baselines, producing ready-to-use 3D meshes for practical applications.