Beyond Static Perception: Integrating Temporal Context into VLMs for Cloth Folding

📅 2025-05-12

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the challenges in cloth folding—namely, high deformation, self-occlusion, and the absence of explicit state representations—by proposing a language-conditioned, end-to-end vision–action mapping framework. To implicitly model cloth dynamics, we innovatively incorporate temporal context into a vision-language model, significantly improving handling of initial wrinkle configurations and recovery from execution failures. Our method extends the BiFold architecture, integrating multi-frame visual inputs, language instruction embeddings, and differentiable action prediction. It employs temporal feature fusion and cross-modal alignment fine-tuning to achieve joint spatiotemporal representation learning. In real-world cloth folding experiments, our approach achieves substantially higher success rates and robustness compared to baseline methods. Representation analysis further validates the effectiveness of text–image region alignment and temporal consistency mechanisms.

Technology Category

Application Category

📝 Abstract

Manipulating clothes is challenging due to their complex dynamics, high deformability, and frequent self-occlusions. Garments exhibit a nearly infinite number of configurations, making explicit state representations difficult to define. In this paper, we analyze BiFold, a model that predicts language-conditioned pick-and-place actions from visual observations, while implicitly encoding garment state through end-to-end learning. To address scenarios such as crumpled garments or recovery from failed manipulations, BiFold leverages temporal context to improve state estimation. We examine the internal representations of the model and present evidence that its fine-tuning and temporal context enable effective alignment between text and image regions, as well as temporal consistency.

Problem

Research questions and friction points this paper is trying to address.

Predict cloth folding actions from visual and language inputs

Handle complex garment dynamics and infinite configurations

Improve state estimation using temporal context for failed manipulations

Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages temporal context for state estimation

Uses end-to-end learning for garment state encoding

Aligns text and image regions via fine-tuning

🔎 Similar Papers

General-purpose Clothes Manipulation with Semantic Keypoints

2024-08-15arXiv.orgCitations: 1

Authors to Follow