๐ค AI Summary
Existing single-image 3D generation methods face two fundamental challenges: 3D diffusion models suffer from limited training data and weak geometric priors, while 2D diffusion approaches struggle to ensure geometric consistency across views. To address these issues, we propose a novel paradigm that synergistically integrates Stable Video Diffusion (SV3D) priors with explicit 3D modeling. Our approach introduces a Gaussian rasterization decoder that distills implicit video representations into an explicit 3D Gaussian field. Furthermore, we formulate a geometry-aware joint optimization framework that simultaneously refines spatial structure and appearance attributes, unifying implicit inference with explicit 3D consistency. Extensive experiments demonstrate state-of-the-art performance in multi-view consistency, 3D structural fidelity, and cross-dataset generalization. Our method jointly produces high-fidelity multi-view images and precise, editable 3D Gaussian modelsโenabling both photorealistic rendering and downstream 3D editing tasks.
๐ Abstract
Image-based 3D generation has vast applications in robotics and gaming, where high-quality, diverse outputs and consistent 3D representations are crucial. However, existing methods have limitations: 3D diffusion models are limited by dataset scarcity and the absence of strong pre-trained priors, while 2D diffusion-based approaches struggle with geometric consistency. We propose a method that leverages 2D diffusion models' implicit 3D reasoning ability while ensuring 3D consistency via Gaussian-splatting-based geometric distillation. Specifically, the proposed Gaussian Splatting Decoder enforces 3D consistency by transforming SV3D latent outputs into an explicit 3D representation. Unlike SV3D, which only relies on implicit 2D representations for video generation, Gaussian Splatting explicitly encodes spatial and appearance attributes, enabling multi-view consistency through geometric constraints. These constraints correct view inconsistencies, ensuring robust geometric consistency. As a result, our approach simultaneously generates high-quality, multi-view-consistent images and accurate 3D models, providing a scalable solution for single-image-based 3D generation and bridging the gap between 2D Diffusion diversity and 3D structural coherence. Experimental results demonstrate state-of-the-art multi-view consistency and strong generalization across diverse datasets. The code will be made publicly available upon acceptance.