$\text{VG}^2$GT: Voxel-Gaussian Splatting Visual Geometry Grounded Transformer

📅 2026-05-31
📈 Citations: 0
Influential: 0
📄 PDF

career value

211K/year
🤖 AI Summary
This work proposes a feedforward 3D reconstruction method that leverages a frozen, pre-trained vision foundation model (VFM) without requiring per-scene fine-tuning. Addressing the limitations of existing Gaussian splatting approaches—which rely on precise camera parameters and scene-specific optimization—and feedforward methods prone to artifacts from uneven Gaussian primitive distributions, our approach extracts multi-scale features from the VFM and employs a differentiable voxel module to directly regress Gaussian parameters. To enhance geometric fidelity, we introduce stochastic solid-volume rendering with depth supervision. This is the first method to jointly model voxels and Gaussians within a VFM framework, achieving state-of-the-art performance on DTU, Replica, TAT, and ScanNet benchmarks while maintaining high reconstruction quality and low computational overhead.
📝 Abstract
Gaussian splatting has shown strong potential for 3D reconstruction and novel view synthesis. However, most existing methods require accurate camera parameters and per-scene optimization, while feed-forward methods with pixel-aligned Gaussian primitives often suffer from artifacts and non-uniform primitives. In this paper, we propose $\text{VG}^2$GT, a Voxel-Gaussian Splatting Visual Geometry-Grounded Transformer. $\text{VG}^2$GT leverages a frozen pretrained visual foundation model (VFM), incorporates a multi-scale differentiable voxel module to enhance geometric understanding, and directly splits and regresses Gaussian primitive parameters from voxel features. During training, depth maps are supervised through stochastic solid volume rendering, enabling geometrically accurate Gaussian scene reconstruction while keeping the visual foundation model fully frozen. This design enables $\text{VG}^2$GT to be seamlessly plugged into any patch-feature-based VFM, while substantially reducing the required training cost. $\text{VG}^2$GT outperforms current state-of-the-art methods on widely used DTU, Replica, TAT, and ScanNet datasets.
Problem

Research questions and friction points this paper is trying to address.

Gaussian splatting
3D reconstruction
novel view synthesis
camera parameters
per-scene optimization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Gaussian splatting
visual foundation model
voxel-based representation
differentiable rendering
3D reconstruction
🔎 Similar Papers
No similar papers found.