🤖 AI Summary
Gaussian Splatting (GS) suffers a severe drop in 3D reconstruction accuracy under extremely sparse multi-view input (e.g., only three images), primarily due to insufficient geometric constraints from Structure-from-Motion (SfM)-derived sparse point clouds for robust Gaussian primitive initialization.
Method: We propose DUSt3R-COLMAP MVS—a hybrid framework that first leverages DUSt3R for camera-agnostic stereo matching, then integrates COLMAP’s multi-view stereo (MVS) to generate high-fidelity dense point clouds, which serve as geometrically robust 2D initialization for Gaussian splats.
Contribution/Results: This is the first method enabling high-fidelity surface reconstruction from as few as three views. On the DTU dataset, it achieves geometric accuracy comparable to full-view GS methods while significantly improving reconstruction completeness and geometric detail fidelity—thereby breaking GS’s reliance on large-scale multi-view inputs.
📝 Abstract
Gaussian Splatting (GS) has gained attention as a fast and effective method for novel view synthesis. It has also been applied to 3D reconstruction using multi-view images and can achieve fast and accurate 3D reconstruction. However, GS assumes that the input contains a large number of multi-view images, and therefore, the reconstruction accuracy significantly decreases when only a limited number of input images are available. One of the main reasons is the insufficient number of 3D points in the sparse point cloud obtained through Structure from Motion (SfM), which results in a poor initialization for optimizing the Gaussian primitives. We propose a new 3D reconstruction method, called Sparse2DGS, to enhance 2DGS in reconstructing objects using only three images. Sparse2DGS employs DUSt3R, a fundamental model for stereo images, along with COLMAP MVS to generate highly accurate and dense 3D point clouds, which are then used to initialize 2D Gaussians. Through experiments on the DTU dataset, we show that Sparse2DGS can accurately reconstruct the 3D shapes of objects using just three images.