Language-Guided Long Horizon Manipulation with LLM-based Planning and Visual Perception

📅 2025-09-02

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

Language-guided long-horizon manipulation of highly deformable objects remains challenging due to high degrees of freedom, complex dynamics, and difficulties in vision–language alignment. Method: This work introduces the first end-to-end language-driven framework unifying planning, perception, and execution, using multi-step cloth folding as a canonical task. It (1) leverages large language models (LLMs) to decompose natural-language instructions into executable action primitives; (2) employs a SigLIP2-based vision–language model (VLM) for fine-grained, language-conditioned cloth state localization; and (3) integrates bidirectional cross-attention fusion with DoRA low-rank adaptation to bridge the semantic–physical gap between language understanding and robotic execution. Results: In simulation, the method achieves 2.23× and 1.87× improvements in success rates on seen and unseen instructions, respectively, and a 33.3% gain on unseen tasks. On real robots, it successfully performs multi-step folding across diverse cloth materials and initial configurations, demonstrating strong robustness and cross-scenario generalization.

Technology Category

Application Category

📝 Abstract

Language-guided long-horizon manipulation of deformable objects presents significant challenges due to high degrees of freedom, complex dynamics, and the need for accurate vision-language grounding. In this work, we focus on multi-step cloth folding, a representative deformable-object manipulation task that requires both structured long-horizon planning and fine-grained visual perception. To this end, we propose a unified framework that integrates a Large Language Model (LLM)-based planner, a Vision-Language Model (VLM)-based perception system, and a task execution module. Specifically, the LLM-based planner decomposes high-level language instructions into low-level action primitives, bridging the semantic-execution gap, aligning perception with action, and enhancing generalization. The VLM-based perception module employs a SigLIP2-driven architecture with a bidirectional cross-attention fusion mechanism and weight-decomposed low-rank adaptation (DoRA) fine-tuning to achieve language-conditioned fine-grained visual grounding. Experiments in both simulation and real-world settings demonstrate the method's effectiveness. In simulation, it outperforms state-of-the-art baselines by 2.23, 1.87, and 33.3 on seen instructions, unseen instructions, and unseen tasks, respectively. On a real robot, it robustly executes multi-step folding sequences from language instructions across diverse cloth materials and configurations, demonstrating strong generalization in practical scenarios. Project page: https://language-guided.netlify.app/

Problem

Research questions and friction points this paper is trying to address.

Long-horizon manipulation of deformable objects like cloth

Bridging semantic-execution gap in language-guided robotics

Achieving fine-grained visual grounding for complex tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM-based planner for action decomposition

VLM-based perception with fine-grained grounding

SigLIP2 architecture with cross-attention fusion

🔎 Similar Papers

No similar papers found.