Dual-Actor Fine-Tuning of VLA Models: A Talk-and-Tweak Human-in-the-Loop Approach

πŸ“… 2025-09-17
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Vision-Language-Action (VLA) models exhibit insufficient generalization in complex, real-world robotic tasks. Method: We propose a human-in-the-loop dual-actor reinforcement fine-tuning framework: a primary actor executes global policies, while a refinement actor performs fine-grained action correction; we introduce a lightweight β€œSay-and-Tune” mechanism that automatically converts real-time human corrections into semantic instructions to construct high-quality policy learning datasets. Our approach integrates latent-space adaptation, semantic alignment optimization, and multi-robot collaborative training. Contributions/Results: On three real-world robotic tasks, our method achieves 100% success rate within just 101 minutes; attains 50% success on long-horizon, 12-step sequential manipulation; and doubles training efficiency via dual-robot parallelization. This work significantly enhances the online adaptability and scalability of VLA models in open-world environments.

Technology Category

Application Category

πŸ“ Abstract
Vision-language-action (VLA) models demonstrate strong generalization in robotic manipulation but face challenges in complex, real-world tasks. While supervised fine-tuning with demonstrations is constrained by data quality, reinforcement learning (RL) offers a promising alternative. We propose a human-in-the-loop dual-actor fine-tuning framework grounded in RL. The framework integrates a primary actor for robust multi-task performance with a refinement actor for latent-space adaptation. Beyond standard physical interventions, we introduce a lightweight talk-and-tweak scheme that converts human corrections into semantically grounded language commands, thereby generating a new dataset for policy learning. In real-world multi-task experiments, our approach achieves 100% success across three tasks within 101 minutes of online fine-tuning. For long-horizon tasks, it sustains a 50% success rate over 12 consecutive operations. Furthermore, the framework scales effectively to multi-robot training, achieving up to a 2 times improvement in efficiency when using dual robots. The experiment videos are available at https://sites.google.com/view/hil-daft/.
Problem

Research questions and friction points this paper is trying to address.

Enhancing VLA model performance in complex robotic manipulation tasks
Overcoming data quality limitations in supervised fine-tuning with RL
Enabling human-guided policy adaptation through language command corrections
Innovation

Methods, ideas, or system contributions that make the work stand out.

Human-in-the-loop dual-actor framework
Talk-and-tweak language command conversion
Latent-space adaptation for policy learning
πŸ”Ž Similar Papers
No similar papers found.
P
Piaopiao Jin
Beijing Xiaomi Robot Technology Co., Ltd.
Q
Qi Wang
Beijing Xiaomi Robot Technology Co., Ltd.
G
Guokang Sun
Beijing Xiaomi Robot Technology Co., Ltd.
Z
Ziwen Cai
City University of Hong Kong
Pinjia He
Pinjia He
Assistant Professor, The Chinese University of Hong Kong, Shenzhen
Software EngineeringAI4SESE4AIAIOps
Yangwei You
Yangwei You
Xiaomi Robotics Lab
Legged LocomotionRobotics