Ouroboros-Spatial: Closing the Data-Model Loop for Spatial Reasoning

📅 2026-06-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the inefficiency of multimodal large language models in 3D spatial reasoning tasks, where static training data fails to align with their dynamic reasoning capabilities. To overcome this limitation, the authors propose a self-evolving training framework that alternates between a frozen question proposer and a learnable solver. Leveraging model prediction confidence as a difficulty signal, the framework dynamically generates 3D spatial question-answer pairs—complete with executable code—that match the solver’s current proficiency. This closed-loop co-evolution mechanism substantially improves data efficiency, enabling Qwen3-VL-4B and Qwen3-VL-8B to outperform strong open- and closed-source baselines on six benchmarks. Notably, using only one-tenth of the standard training samples, the models achieve performance gains of 9.9 and 6.8 points, respectively, on VSI-Bench.
📝 Abstract
Spatial reasoning remains a persistent challenge for multimodal large language models (MLLMs). Existing approaches largely rely on large-scale, statically curated datasets, where all training samples are treated uniformly regardless of the model's evolving capabilities. This static paradigm is inherently data-inefficient: training capacity is often spent on samples that are either trivial or overly difficult for the model at its current stage. To address this limitation, we propose Ouroboros-Spatial, a self-evolving training framework in which the model plays dual roles as a proposer and a solver. In each iteration, a frozen proposer generates spatial question-answer (QA) pairs from 3D scene metadata and raw video frames, together with executable code for deriving reliable ground truth. A learnable solver is then fine-tuned on the accepted samples, and its per-sample prediction confidence is used as a difficulty signal. This signal is fed back to the proposer in the next iteration, guiding it to generate questions better matched to the solver's current capabilities. Through this closed-loop design, the training distribution co-evolves with model ability, reducing redundant trivial examples while filtering out ambiguous or uninformative samples with limited learning value. Across six spatial reasoning benchmarks, Ouroboros-Spatial substantially improves Qwen3-VL-4B and Qwen3-VL-8B while using an order of magnitude fewer training examples than recent large-scale curated datasets. On VSI-Bench, it yields absolute gains of 9.9 and 6.8 points for the 4B and 8B models, respectively, enabling both to outperform a wide range of strong open-source and proprietary baselines.
Problem

Research questions and friction points this paper is trying to address.

spatial reasoning
multimodal large language models
data efficiency
training data curation
model capability alignment
Innovation

Methods, ideas, or system contributions that make the work stand out.

self-evolving training
closed-loop data-model co-evolution
spatial reasoning
difficulty-aware sampling
multimodal large language models
🔎 Similar Papers
No similar papers found.