đ¤ AI Summary
This work addresses the problem of natural languageâguided dual-arm robotic cloth folding, tackling challenges including garment self-occlusion, inaccurate cloth dynamics modeling, and poor generalization across diverse fabric types and physical environments. We propose the first language-conditioned dual-arm cloth folding framework comprising: (1) an end-to-end language-to-action mapping leveraging a pre-trained vision-language model; (2) a simulation-based action parsing and text-alignment method to mitigate the scarcity of real-world human annotations; and (3) a unified policy integrating dual-arm coordinated motion planning with cloth dynamicsâaware control. Our approach achieves state-of-the-art performance on a language-guided cloth folding benchmark, attains superior accuracy on our newly constructed multi-fabric dataset, and demonstrates strong zero-shot generalization to unseen garment categories, novel instructions, and previously unencountered physical environments.
đ Abstract
Cloth folding is a complex task due to the inevitable self-occlusions of clothes, their complicated dynamics, and the disparate materials, geometries, and textures that garments can have. In this work, we learn folding actions conditioned on text commands. Translating high-level, abstract instructions into precise robotic actions requires sophisticated language understanding and manipulation capabilities. To do that, we leverage a pre-trained vision-language model and repurpose it to predict manipulation actions. Our model, BiFold, can take context into account and achieves state-of-the-art performance on an existing language-conditioned folding benchmark. Given the lack of annotated bimanual folding data, we devise a procedure to automatically parse actions of a simulated dataset and tag them with aligned text instructions. BiFold attains the best performance on our dataset and can transfer to new instructions, garments, and environments.