🤖 AI Summary
Existing 3D agents often exhibit limited performance due to scene-agnostic tool preferences or misuse. This work proposes Skill-3D, a novel framework that introduces, for the first time, a scene-aware self-evolving skill mechanism. By constructing scene memory, the framework clusters and distills tool usage trajectories, extracting successful experiences into reusable skills while storing failure cases as lessons. These skills are dynamically injected during decision-making in similar future scenarios to guide agent behavior. The approach enables synergistic co-evolution of the skill library and memory system. Evaluated on VSI-Bench, it increases tool utilization from 39% to 78%; when applied to Qwen3-VL-8B, skill-guided training yields a 43% performance gain, and Gemini-3-Flash achieves a 67% improvement on MMSI-Bench.
📝 Abstract
This paper explores agentic 3D spatial understanding, i.e., MLLM agents performing 3D reasoning through tool use. Existing methods often misuse tools and exhibit biased tool preferences under 3D scenarios, leaving the agentic paradigm with only marginal gains over non-agentic strategies. We reveal that 3D spatial reasoning tasks are heterogeneous across scenes, while these agents apply a uniform tool-use strategy to all scenes rather than selecting tools according to the specific scene and task. To address this, we propose Skill-3D, a framework that learns self-evolving scene-aware skills. Specifically, Skill-3D identifies the task scene and records the agent's tool-use trajectory into a Scene Memory, where successful trajectories from similar scenes are aggregated and distilled into a reusable scene-aware skill, with failed ones attached to the skill as lessons. During training, once a similar scene recurs, the corresponding skill is injected to guide the agent, producing new trajectories whose successes and failures further refine the skill, forming a loop in which the memory and the skill library co-evolve. Experiments show that Skill-3D substantially improves tool utilization in 3D spatial reasoning (from 39% to 78% on VSI-Bench), driving the agent toward correct and sufficient tool use. For instance, it improves Gemini-3-Flash by 67% on MMSI-Bench. Furthermore, we conduct agentic post-training over skill-guided trajectories, which boosts Qwen3-VL-8B by 43% on VSI-Bench.