G-CUT3R: Guided 3D Reconstruction with Camera and Depth Prior Integration

📅 2025-08-15

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

This work addresses the limited geometric accuracy and robustness of feed-forward single-image 3D scene reconstruction in real-world scenarios. We propose a lightweight multimodal fusion framework to enhance the performance of CUT3R-style models. Methodologically, we design dedicated encoders for geometric priors—including depth, camera intrinsics, and pose—and introduce a zero-initialized convolution mechanism to enable dynamic, plug-and-play fusion of prior features with RGB image tokens. Furthermore, we modify the Transformer architecture to support cross-modal guided reconstruction. Experiments demonstrate that our approach significantly outperforms existing feed-forward methods on multiple 3D reconstruction and multi-view benchmarks (e.g., ScanNet, Matterport3D), achieving superior geometric fidelity, generalization capability, and flexibility in modality composition. These results empirically validate the critical performance gain from explicitly incorporating geometric priors into single-image reconstruction.

Technology Category

Application Category

📝 Abstract

We introduce G-CUT3R, a novel feed-forward approach for guided 3D scene reconstruction that enhances the CUT3R model by integrating prior information. Unlike existing feed-forward methods that rely solely on input images, our method leverages auxiliary data, such as depth, camera calibrations, or camera positions, commonly available in real-world scenarios. We propose a lightweight modification to CUT3R, incorporating a dedicated encoder for each modality to extract features, which are fused with RGB image tokens via zero convolution. This flexible design enables seamless integration of any combination of prior information during inference. Evaluated across multiple benchmarks, including 3D reconstruction and other multi-view tasks, our approach demonstrates significant performance improvements, showing its ability to effectively utilize available priors while maintaining compatibility with varying input modalities.

Problem

Research questions and friction points this paper is trying to address.

Enhances 3D scene reconstruction using prior information

Integrates depth and camera data with RGB images

Improves performance across multi-view reconstruction tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates depth and camera priors

Uses dedicated encoders for each modality

Fuses features via zero convolution

🔎 Similar Papers

Generalizable 3D Scene Reconstruction via Divide and Conquer from a Single View