🤖 AI Summary
This work addresses the limited generalization of existing vision-language-action models in folding deformable objects—such as garments—across diverse categories, materials, and scenes, as well as interference issues in multi-task training. The authors propose a unified vision-language-action foundation model pretrained on large-scale real-world bimanual manipulation data to acquire general manipulation priors, followed by post-training via human-in-the-loop DAgger. The approach introduces a category-agnostic manipulation policy that leverages flow matching to generate smooth, continuous actions and incorporates a lightweight action expert module based on pruned Transformers. This design maintains alignment with the vision-language model while significantly reducing computational overhead. The resulting model demonstrates strong cross-task generalization on both the RoboTwin simulation and real-world household garment-folding benchmarks.
📝 Abstract
Real-world household robots require Vision-Language-Action (VLA) foundation models that can acquire reusable manipulation skills across diverse objects, task conditions, and household environments. Deformable-object folding is a representative challenge, requiring robots to handle clothing items from random initial states across varying categories, geometries, materials, and scenes. However, existing VLA systems commonly train separate policies for different object categories, while naively mixed multi-task training often suffers from task interference and degraded performance. To move beyond category-specific folding policies, we introduce DeMaVLA, a VLA foundation model for generalizable Deformable Manipulation. DeMaVLA adopts a VLM backbone with an action expert and formulates continuous action generation using flow matching. To improve efficiency, the action expert is constructed by pruning every other transformer layer while preserving layer-wise alignment with the VLM backbone, reducing training and inference cost. DeMaVLA is first pre-trained on approximately 5,000 hours of selected real-world dual-arm demonstrations to acquire general manipulation priors. It is then post-trained on mixed folding data that aggregates self-collected demonstrations and corrective trajectories from real-robot failures across multiple folding tasks through a human-in-the-loop Data Aggregation~(DAgger) pipeline. Experiments show that DeMaVLA achieves competitive performance on RoboTwin and strong real-world results on our household folding benchmark. These results highlight the value of scalable real-world data, efficient action generation, and corrective learning for general-purpose VLA policies in deformable-object manipulation.