SpatialTraceGen: High-Fidelity Traces for Efficient VLM Spatial Reasoning Distillation

📅 2025-10-28

📈 Citations: 0

✨ Influential: 0

career value

173K/year

🤖 AI Summary

Vision-language models (VLMs) exhibit limited performance on complex spatial reasoning tasks, primarily due to the scarcity of high-quality, multi-step reasoning data. Method: We propose a scalable data construction paradigm wherein large language models serve as teachers to generate multi-hop, multi-tool reasoning trajectories; an automated Verifier rigorously validates logical consistency and spatial fidelity at each reasoning step, substantially improving trajectory quality and reducing annotation cost. The resulting structured dataset enables efficient supervised fine-tuning and offline reinforcement learning for small-scale VLMs. Results: On the CLEVR-Humans benchmark, our approach increases average trajectory quality score by 17% and reduces variance by over 40%, significantly enhancing small-model spatial reasoning capability. Our core contribution is a Verifier-driven, high-fidelity trajectory distillation framework—offering a novel pathway to augment spatial reasoning in lightweight VLMs.

Technology Category

Application Category

📝 Abstract

While Vision-Language Models (VLMs) excel in many areas, they struggle with complex spatial reasoning, which requires problem decomposition and strategic tool use. Fine-tuning smaller, more deployable models offers an efficient path to strong performance, but this is hampered by a major bottleneck: the absence of high-quality, step-by-step reasoning data. To address this data-efficiency gap, we introduce SpatialTraceGen, a framework to distill the reasoning processes of a large teacher model into a high-quality dataset of multi-hop, multi-tool reasoning traces. A key innovation is our automated Verifier, which scalably ensures the fidelity of each reasoning step, providing a cost-effective alternative to manual human annotation. On the CLEVR-Humans benchmark, this verifier-guided process improves the average quality score of traces by 17% while reducing quality variance by over 40%. SpatialTraceGen delivers a dataset of expert traces, providing the structured, step-by-step examples of tool use necessary for effective fine-tuning and sample-efficient offline reinforcement learning.

Problem

Research questions and friction points this paper is trying to address.

Addressing the lack of high-quality spatial reasoning data

Automating verification to ensure step-by-step reasoning fidelity

Enabling efficient distillation of complex reasoning into smaller models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Automated verifier ensures reasoning step fidelity

Distills teacher model into multi-tool reasoning traces

Generates structured step-by-step tool use examples

🔎 Similar Papers

No similar papers found.