Learning to Assist: Collaborative VLAs for Implicit Human-Robot Collaboration

📅 2026-06-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing implicit human-robot collaboration systems rely on handcrafted pipelines, limiting their generalization to novel tasks and often triggering premature assistance due to fixed action chunking strategies. This work identifies, for the first time, an action leakage issue in vision-language-action (VLA) models under imitation learning, which leads to inappropriate early interventions. To address this, we introduce a training-free, inference-time guidance mechanism that effectively suppresses erroneous assistive actions. User studies demonstrate that our approach significantly reduces collaboration failures, improves task efficiency, and enables longer execution horizons without retraining the underlying model.
📝 Abstract
Human-robot collaboration (HRC) combines the complementary strengths of humans and robots to improve task efficiency. However, many existing collaborative systems rely on hand-engineered pipelines, limiting their scalability and flexibility for new tasks. In this work, we show that models trained end-to-end with imitation learning, specifically vision-language-action (VLA) models, can support collaborative manipulation, and characterize the key factors affecting their real-world performance. We evaluate two state-of-the-art models and identify a failure mode of action-chunking policies in implicit HRC, where demonstration action leakage (i.e., action chunks crossing latent task transitions) can cause premature assistive behavior. We find that this issue increases with longer execution horizons and occurs in real-world collaborative VLA systems, such as when a robot attempts to hand over a tool before the person is ready. We propose an inference-time steering method to mitigate these erroneous assistive actions while preserving policy performance. Finally, through a 16-participant user study on a long-horizon collaborative assembly task, we show that steering enables a longer execution horizon while mitigating premature assistance, leading to faster collaboration and fewer failures compared to a shorter-horizon policy.
Problem

Research questions and friction points this paper is trying to address.

human-robot collaboration
vision-language-action models
action chunking
premature assistance
imitation learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

vision-language-action (VLA)
implicit human-robot collaboration
action-chunking
inference-time steering
imitation learning