π€ AI Summary
To address the challenges of long-horizon planning, high uncertainty, and weak cross-location coordination in generalized multi-object collection tasks for domestic robots operating over large-scale scene graphs, this paper proposes Inter-LLM: an interleaved LLMβmotion-planning framework. It integrates multimodal action-cost similarity functions and explicitly models historical states alongside future predictions, enabling joint task-level semantic reasoning and motion-level path optimization. The method achieves a 30% improvement in task completion rate in simulation, significantly enhances multi-turn human-robot instruction execution success, and reduces overall task cost. By unifying high-level symbolic reasoning with low-level geometric control, Inter-LLM provides a scalable, human-like intelligence framework for open-set object manipulation and large-scale environment navigation.
π Abstract
Household robots have been a longstanding research topic, but they still lack human-like intelligence, particularly in manipulating open-set objects and navigating large environments efficiently and accurately. To push this boundary, we consider a generalized multi-object collection problem in large scene graphs, where the robot needs to pick up and place multiple objects across multiple locations in a long mission of multiple human commands. This problem is extremely challenging since it requires long-horizon planning in a vast action-state space under high uncertainties. To this end, we propose a novel interleaved LLM and motion planning algorithm Inter-LLM. By designing a multimodal action cost similarity function, our algorithm can both reflect the history and look into the future to optimize plans, striking a good balance of quality and efficiency. Simulation experiments demonstrate that compared with latest works, our algorithm improves the overall mission performance by 30% in terms of fulfilling human commands, maximizing mission success rates, and minimizing mission costs.