🤖 AI Summary
Large language models (LLMs) lack explicit spatial reasoning capabilities in 3D Cartesian space, hindering precise object manipulation at the [x,y,z] coordinate level. Method: We propose a semantics-driven spatial tokenization mechanism that encodes elevation as interpretable semantic units—the first such approach—and introduce a symbolic compositional training paradigm integrating geometric priors and logical rules to synthesize high-quality spatial reasoning data. We then perform spatially aware fine-tuning of LLMs to enable end-to-end alignment between 3D coordinate instructions and robotic actions. Contribution/Results: Our method achieves 66.67% accuracy on object manipulation subtasks, substantially outperforming GPT-4o (37.5%) and Claude 3.5 Sonnet (29.17%). It establishes a novel, interpretable, and generalizable spatial reasoning paradigm for grounding LLMs in embodied intelligence.
📝 Abstract
This paper presents AlphaSpace, a novel methodology designed to enhance the spatial reasoning capabilities of large language models (LLMs) for 3D Cartesian space navigation. AlphaSpace employs a semantics-based tokenization strategy, encoding height information through specialized semantic tokens, and integrates primarily symbolic synthetic reasoning data. This approach enables LLMs to accurately manipulate objects by positioning them at specific [x, y, z] coordinates. Experimental results demonstrate that AlphaSpace significantly outperforms existing models on manipulation subtasks, achieving a total accuracy of 66.67%, compared to 37.5% for GPT-4o and 29.17% for Claude 3.5 Sonnet.