Unpaired RGB-Thermal Gaussian-Splatting Using Visual Geometric Transformers

📅 2026-06-03

📈 Citations: 0

✨ Influential: 0

career value

181K/year

🤖 AI Summary

This work addresses the practical limitations of existing RGB-thermal novel view synthesis methods, which rely on precisely paired data or stereo calibration. To overcome this constraint, we propose the first unpaired multimodal joint 3D reconstruction framework. Our approach employs a Visual Geometry-aware Graph Transformer (VGGT) to independently estimate camera poses for each modality, aligns these poses via cross-modal feature matching and Procrustes analysis, and constructs a multimodal 3D Gaussian Splatting model for joint rendering. This method achieves, for the first time, consistent 3D reconstruction from unpaired RGB and thermal images, introduces a new cross-modal consistency evaluation benchmark, and significantly outperforms current approaches across diverse scenes while preserving high-quality thermal view synthesis and RGB fidelity.

📝 Abstract

Multi-modal novel view synthesis (NVS) combining RGB and thermal imagery enables precise 3D scene reconstruction with visual and thermal information. However, existing methods typically rely on precisely calibrated RGB-thermal image pairs or stereo setups, limiting scalability and practical deployment. To address this, we introduce a framework for unpaired RGB-thermal NVS that leverages VGGT, a 3D feed-forward transformer architecture, to independently estimate camera poses for each modality. The pose sets are then aligned using the Procrustes algorithm with a cross-modal feature matcher, enabling joint registration without paired calibration. Building on this alignment, we further propose a multi-modal 3D Gaussian Splatting approach that learns directly from unpaired RGB and thermal images. Experiments on diverse scenes demonstrate that our method achieves competitive performance in thermal view synthesis while maintaining RGB fidelity. Moreover, we show that existing reconstruction approaches can produce modality-specific reconstructions that lack cross-modal consistency. We thus introduce a benchmarking framework to rigorously evaluate both per-modality image synthesis and the multi-modal coherence of reconstructed scenes.

Problem

Research questions and friction points this paper is trying to address.

unpaired RGB-thermal

multi-modal novel view synthesis

cross-modal consistency

3D scene reconstruction

Innovation

Methods, ideas, or system contributions that make the work stand out.

unpaired multi-modal NVS

Visual Geometric Transformer

3D Gaussian Splatting

cross-modal pose alignment