🤖 AI Summary
Current vision-language models lack a hierarchical foundation—from perception to understanding—for spatial reasoning, resulting in poor robustness. To address this, we propose SpatialLadder: a novel framework featuring a three-stage progressive training paradigm; the first multimodal benchmark—SpatialLadder-26k—covering localization, single-image, multi-view, and video-based spatial reasoning; and an integrated learning strategy combining multi-stage supervised pretraining, multi-dimensional spatial task learning, and verifiable-reward-driven reinforcement learning. The resulting 3B-parameter model, SpatialLadder, achieves an average +23.4% improvement over strong baselines on spatial reasoning benchmarks—significantly outperforming GPT-4o (+20.8%) and Gemini-2.0-Flash (+10.1%). It also demonstrates +7.2% out-of-distribution generalization. SpatialLadder is the first model to systematically bridge low-level perception and high-level spatial reasoning through unified architectural and training design.
📝 Abstract
Spatial reasoning remains a fundamental challenge for Vision-Language Models (VLMs), with current approaches struggling to achieve robust performance despite recent advances. We identify that this limitation stems from a critical gap: existing methods attempt to learn spatial reasoning directly without establishing the hierarchical foundations of perception and understanding. To address this challenge, we present a comprehensive methodology for building spatial intelligence progressively. We introduce SpatialLadder-26k, a multimodal dataset containing 26,610 samples spanning object localization, single image, multi-view, and video spatial reasoning tasks, constructed through a standardized pipeline that ensures systematic coverage across modalities. Building on this dataset, we design a three-stage progressive training framework that (1) establishes spatial perception through object localization, (2) develops spatial understanding through multi-dimensional spatial tasks, and (3) strengthens complex reasoning via reinforcement learning with verifiable rewards. This approach yields SpatialLadder, a 3B-parameter model that achieves state-of-the-art performance on spatial reasoning benchmarks, with 23.4% average improvement over the base model, surpassing GPT-4o by 20.8% and Gemini-2.0-Flash by 10.1%. Notably, SpatialLadder maintains strong generalization with 7.2% improvement on out-of-domain benchmarks, demonstrating that progressive training from perception to reasoning is essential for robust spatial intelligence.