SpatialTraceGen: High-Fidelity Traces for Efficient VLM Spatial Reasoning Distillation

📅 2025-10-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Vision-language models (VLMs) exhibit limited performance on complex spatial reasoning tasks, primarily due to the scarcity of high-quality, multi-step reasoning data. Method: We propose a scalable data construction paradigm wherein large language models serve as teachers to generate multi-hop, multi-tool reasoning trajectories; an automated Verifier rigorously validates logical consistency and spatial fidelity at each reasoning step, substantially improving trajectory quality and reducing annotation cost. The resulting structured dataset enables efficient supervised fine-tuning and offline reinforcement learning for small-scale VLMs. Results: On the CLEVR-Humans benchmark, our approach increases average trajectory quality score by 17% and reduces variance by over 40%, significantly enhancing small-model spatial reasoning capability. Our core contribution is a Verifier-driven, high-fidelity trajectory distillation framework—offering a novel pathway to augment spatial reasoning in lightweight VLMs.

Technology Category

Application Category

📝 Abstract
While Vision-Language Models (VLMs) excel in many areas, they struggle with complex spatial reasoning, which requires problem decomposition and strategic tool use. Fine-tuning smaller, more deployable models offers an efficient path to strong performance, but this is hampered by a major bottleneck: the absence of high-quality, step-by-step reasoning data. To address this data-efficiency gap, we introduce SpatialTraceGen, a framework to distill the reasoning processes of a large teacher model into a high-quality dataset of multi-hop, multi-tool reasoning traces. A key innovation is our automated Verifier, which scalably ensures the fidelity of each reasoning step, providing a cost-effective alternative to manual human annotation. On the CLEVR-Humans benchmark, this verifier-guided process improves the average quality score of traces by 17% while reducing quality variance by over 40%. SpatialTraceGen delivers a dataset of expert traces, providing the structured, step-by-step examples of tool use necessary for effective fine-tuning and sample-efficient offline reinforcement learning.
Problem

Research questions and friction points this paper is trying to address.

Addressing the lack of high-quality spatial reasoning data
Automating verification to ensure step-by-step reasoning fidelity
Enabling efficient distillation of complex reasoning into smaller models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Automated verifier ensures reasoning step fidelity
Distills teacher model into multi-tool reasoning traces
Generates structured step-by-step tool use examples
🔎 Similar Papers
No similar papers found.
G
Gio Huh
Computing + Mathematical Sciences, California Institute of Technology
D
Dhruv Sheth
Computing + Mathematical Sciences, California Institute of Technology
Rayhan Zirvi
Rayhan Zirvi
Undergraduate Student, Caltech
Machine LearningDiffusion ModelsGenerative Models
Frank Xiao
Frank Xiao
Caltech
Machine Learning