đ¤ AI Summary
This work addresses the challenges in cloth foldingânamely, high deformation, self-occlusion, and the absence of explicit state representationsâby proposing a language-conditioned, end-to-end visionâaction mapping framework. To implicitly model cloth dynamics, we innovatively incorporate temporal context into a vision-language model, significantly improving handling of initial wrinkle configurations and recovery from execution failures. Our method extends the BiFold architecture, integrating multi-frame visual inputs, language instruction embeddings, and differentiable action prediction. It employs temporal feature fusion and cross-modal alignment fine-tuning to achieve joint spatiotemporal representation learning. In real-world cloth folding experiments, our approach achieves substantially higher success rates and robustness compared to baseline methods. Representation analysis further validates the effectiveness of textâimage region alignment and temporal consistency mechanisms.
đ Abstract
Manipulating clothes is challenging due to their complex dynamics, high deformability, and frequent self-occlusions. Garments exhibit a nearly infinite number of configurations, making explicit state representations difficult to define. In this paper, we analyze BiFold, a model that predicts language-conditioned pick-and-place actions from visual observations, while implicitly encoding garment state through end-to-end learning. To address scenarios such as crumpled garments or recovery from failed manipulations, BiFold leverages temporal context to improve state estimation. We examine the internal representations of the model and present evidence that its fine-tuning and temporal context enable effective alignment between text and image regions, as well as temporal consistency.