🤖 AI Summary
This work addresses the fundamental trade-off between interpretability and open-vocabulary understanding in 3D spatial reasoning: neuro-symbolic systems are confined to closed concept sets, while end-to-end multimodal large language models lack explicit spatial verification. To bridge this gap, we propose APEIRIA, which for the first time distills the reasoning patterns of neuro-symbolic programs into a 3D multimodal large language model through a three-stage curriculum learning framework, enabling transparent yet flexible spatial reasoning via natural language chain-of-thought. Our approach integrates 3D perceptual alignment, chain-of-thought supervised fine-tuning (CoT-SFT) guided by symbolic program traces, and chain-of-thought reinforcement learning (CoT-RL), preserving the interpretability and modularity of symbolic systems while supporting open-vocabulary inputs and complex instructions. Experiments demonstrate that APEIRIA significantly outperforms existing neuro-symbolic methods on referential grounding, question answering, and description tasks, achieving state-of-the-art performance among 3D multimodal large language models.
📝 Abstract
Current 3D spatial reasoning methods face a fundamental trade-off: neuro-symbolic 3D (NS3D) concept learners achieve interpretable reasoning through compositional programs but are constrained to closed-set concept vocabularies and simple programs; end-to-end 3D multi-modal LLMs (3D MLLMs) could handle complex natural language and open-vocabulary concepts but suffer from black-box reasoning without explicit spatial verification. We introduce APEIRIA, a neuro-symbolic 3D MLLM to bridge two paradigms by distilling symbolic reasoning patterns into MLLMs with natural language chain-of-thought. Our three-stage curriculum progressively builds reasoning capabilities: a) 3D perception alignment grounds object visual-geometric features to the LLM, b) CoT-SFT teaches query decomposition and stepwise verification from symbolic program traces, and c) CoT-RL extends reasoning patterns to open-set concepts and deeply nested instructions. By transferring reasoning patterns rather than concept-specific knowledge, APEIRIA preserves key NS3D virtues: transparent reasoning and modular interchangeability of planning and perception components. Evaluations on grounding, question answering, and captioning show that APEIRIA surpasses prior NS3D methods and matches state-of-the-art 3D MLLMs on 3D spatial reasoning datasets, unifying symbolic methods' systematic reasoning with MLLMs' flexibility. Code is available at https://github.com/oceanflowlab/APEIRIA.