Dual-Actor Fine-Tuning of VLA Models: A Talk-and-Tweak Human-in-the-Loop Approach

πŸ“… 2025-09-17
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

229K/year
πŸ€– AI Summary
Vision-Language-Action (VLA) models exhibit insufficient generalization in complex, real-world robotic tasks. Method: We propose a human-in-the-loop dual-actor reinforcement fine-tuning framework: a primary actor executes global policies, while a refinement actor performs fine-grained action correction; we introduce a lightweight β€œSay-and-Tune” mechanism that automatically converts real-time human corrections into semantic instructions to construct high-quality policy learning datasets. Our approach integrates latent-space adaptation, semantic alignment optimization, and multi-robot collaborative training. Contributions/Results: On three real-world robotic tasks, our method achieves 100% success rate within just 101 minutes; attains 50% success on long-horizon, 12-step sequential manipulation; and doubles training efficiency via dual-robot parallelization. This work significantly enhances the online adaptability and scalability of VLA models in open-world environments.

Technology Category

Application Category

πŸ“ Abstract
Vision-language-action (VLA) models demonstrate strong generalization in robotic manipulation but face challenges in complex, real-world tasks. While supervised fine-tuning with demonstrations is constrained by data quality, reinforcement learning (RL) offers a promising alternative. We propose a human-in-the-loop dual-actor fine-tuning framework grounded in RL. The framework integrates a primary actor for robust multi-task performance with a refinement actor for latent-space adaptation. Beyond standard physical interventions, we introduce a lightweight talk-and-tweak scheme that converts human corrections into semantically grounded language commands, thereby generating a new dataset for policy learning. In real-world multi-task experiments, our approach achieves 100% success across three tasks within 101 minutes of online fine-tuning. For long-horizon tasks, it sustains a 50% success rate over 12 consecutive operations. Furthermore, the framework scales effectively to multi-robot training, achieving up to a 2 times improvement in efficiency when using dual robots. The experiment videos are available at https://sites.google.com/view/hil-daft/.
Problem

Research questions and friction points this paper is trying to address.

Enhancing VLA model performance in complex robotic manipulation tasks
Overcoming data quality limitations in supervised fine-tuning with RL
Enabling human-guided policy adaptation through language command corrections
Innovation

Methods, ideas, or system contributions that make the work stand out.

Human-in-the-loop dual-actor framework
Talk-and-tweak language command conversion
Latent-space adaptation for policy learning
πŸ”Ž Similar Papers
No similar papers found.