GSV3D: Gaussian Splatting-based Geometric Distillation with Stable Video Diffusion for Single-Image 3D Object Generation

๐Ÿ“… 2025-03-08
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Existing single-image 3D generation methods face two fundamental challenges: 3D diffusion models suffer from limited training data and weak geometric priors, while 2D diffusion approaches struggle to ensure geometric consistency across views. To address these issues, we propose a novel paradigm that synergistically integrates Stable Video Diffusion (SV3D) priors with explicit 3D modeling. Our approach introduces a Gaussian rasterization decoder that distills implicit video representations into an explicit 3D Gaussian field. Furthermore, we formulate a geometry-aware joint optimization framework that simultaneously refines spatial structure and appearance attributes, unifying implicit inference with explicit 3D consistency. Extensive experiments demonstrate state-of-the-art performance in multi-view consistency, 3D structural fidelity, and cross-dataset generalization. Our method jointly produces high-fidelity multi-view images and precise, editable 3D Gaussian modelsโ€”enabling both photorealistic rendering and downstream 3D editing tasks.

Technology Category

Application Category

๐Ÿ“ Abstract
Image-based 3D generation has vast applications in robotics and gaming, where high-quality, diverse outputs and consistent 3D representations are crucial. However, existing methods have limitations: 3D diffusion models are limited by dataset scarcity and the absence of strong pre-trained priors, while 2D diffusion-based approaches struggle with geometric consistency. We propose a method that leverages 2D diffusion models' implicit 3D reasoning ability while ensuring 3D consistency via Gaussian-splatting-based geometric distillation. Specifically, the proposed Gaussian Splatting Decoder enforces 3D consistency by transforming SV3D latent outputs into an explicit 3D representation. Unlike SV3D, which only relies on implicit 2D representations for video generation, Gaussian Splatting explicitly encodes spatial and appearance attributes, enabling multi-view consistency through geometric constraints. These constraints correct view inconsistencies, ensuring robust geometric consistency. As a result, our approach simultaneously generates high-quality, multi-view-consistent images and accurate 3D models, providing a scalable solution for single-image-based 3D generation and bridging the gap between 2D Diffusion diversity and 3D structural coherence. Experimental results demonstrate state-of-the-art multi-view consistency and strong generalization across diverse datasets. The code will be made publicly available upon acceptance.
Problem

Research questions and friction points this paper is trying to address.

Addresses limitations in 3D diffusion models due to dataset scarcity and lack of pre-trained priors.
Ensures 3D consistency in single-image 3D generation using Gaussian-splatting-based geometric distillation.
Bridges the gap between 2D diffusion diversity and 3D structural coherence for robust geometric consistency.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages 2D diffusion models for 3D reasoning
Uses Gaussian Splatting for 3D consistency
Transforms SV3D outputs into explicit 3D representations
๐Ÿ”Ž Similar Papers
No similar papers found.
Y
Ye Tao
State Key Laboratory of Virtual Reality Technology and Systems, Beihang University
J
Jiawei Zhang
SenseTime Research
Yahao Shi
Yahao Shi
Beihang University
computer graphics
D
Dongqing Zou
SenseTime Research, PBVR
B
Bin Zhou
State Key Laboratory of Virtual Reality Technology and Systems, Beihang University, PBVR