Learning Geometric Representations from Videos for Spatial Intelligent Multimodal Large Language Models

📅 2026-06-04

📈 Citations: 0

✨ Influential: 0

career value

207K/year

🤖 AI Summary

This work addresses the inherent lack of intrinsic 3D spatial awareness in multimodal large language models, which often results in geometric and spatial inconsistencies across video frames. To overcome this limitation, the authors propose GeoVR, a novel framework that, for the first time, enables joint optimization of internal model representations using only 2D video sequences—without requiring large-scale 3D training data. GeoVR leverages four geometric constraints: inter-frame camera pose estimation, dense depth regression, metric scale prediction, and multi-scale 3D feature distillation—to transfer geometric knowledge from pretrained 3D foundation models into the semantic latent space. This approach substantially enhances the model’s spatial reasoning capabilities, achieving state-of-the-art performance on relevant benchmarks and establishing a new paradigm for endowing foundation models with spatial intelligence.

📝 Abstract

Multimodal Large Language Models (MLLMs) excel at 2D semantic understanding but lack intrinsic 3D awareness, resulting in representations that fail to maintain geometric and spatial consistency across video frames. Given the scarcity of large-scale 3D data, we present GeoVR, a novel framework that learns geometric representations using purely 2D video sequences. This approach effectively restructures the semantic latent space within MLLMs to unlock spatial intelligence. Rather than employing superficial feature mixing, GeoVR reshapes the internal representations of the MLLM by distilling geometry knowledge from pre-trained 3D foundation models. This is accomplished through a multi-objective learning strategy driven by four complementary geometric targets: (1) estimating inter-frame camera poses to embed varying viewpoint dynamics, (2) regressing dense depth maps to anchor physical distances, (3) predicting a metric scale factor for real-world calibration, and (4) distilling multi-scale 3D features to align the intermediate feature space. Guided by these explicit physical and geometric constraints, the model's internal representations naturally develop strong 3D awareness. Extensive experiments on spatial reasoning benchmarks demonstrate that GeoVR achieves state-of-the-art performance, establishing a new paradigm for endowing foundation models with spatial intelligence.

Problem

Research questions and friction points this paper is trying to address.

Multimodal Large Language Models

3D awareness

spatial consistency

geometric representations

video understanding

Innovation

Methods, ideas, or system contributions that make the work stand out.

geometric representation learning

spatial intelligence

multimodal large language models