Beyond Static Perception: Integrating Temporal Context into VLMs for Cloth Folding

📅 2025-05-12
📈 Citations: 0
✨ Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenges in cloth folding—namely, high deformation, self-occlusion, and the absence of explicit state representations—by proposing a language-conditioned, end-to-end vision–action mapping framework. To implicitly model cloth dynamics, we innovatively incorporate temporal context into a vision-language model, significantly improving handling of initial wrinkle configurations and recovery from execution failures. Our method extends the BiFold architecture, integrating multi-frame visual inputs, language instruction embeddings, and differentiable action prediction. It employs temporal feature fusion and cross-modal alignment fine-tuning to achieve joint spatiotemporal representation learning. In real-world cloth folding experiments, our approach achieves substantially higher success rates and robustness compared to baseline methods. Representation analysis further validates the effectiveness of text–image region alignment and temporal consistency mechanisms.

Technology Category

Application Category

📝 Abstract
Manipulating clothes is challenging due to their complex dynamics, high deformability, and frequent self-occlusions. Garments exhibit a nearly infinite number of configurations, making explicit state representations difficult to define. In this paper, we analyze BiFold, a model that predicts language-conditioned pick-and-place actions from visual observations, while implicitly encoding garment state through end-to-end learning. To address scenarios such as crumpled garments or recovery from failed manipulations, BiFold leverages temporal context to improve state estimation. We examine the internal representations of the model and present evidence that its fine-tuning and temporal context enable effective alignment between text and image regions, as well as temporal consistency.
Problem

Research questions and friction points this paper is trying to address.

Predict cloth folding actions from visual and language inputs
Handle complex garment dynamics and infinite configurations
Improve state estimation using temporal context for failed manipulations
Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages temporal context for state estimation
Uses end-to-end learning for garment state encoding
Aligns text and image regions via fine-tuning
🔎 Similar Papers
O
Oriol Barbany
Institut de Robòtica i Informàtica Industrial, CSIC-UPC
A
Adria ColomĂŠ
Institut de Robòtica i Informàtica Industrial, CSIC-UPC
Carme Torras
Carme Torras
Institut de Robòtica i Informàtica Industrial (CSIC-UPC)
Robotics and Artificial IntelligenceRobot LearningRobot VisionConstraint SatisfactionRobot Kinematics