🤖 AI Summary
Existing vision-language-action (VLA) models operate at a fixed execution speed, making it difficult to simultaneously achieve efficiency during low-risk phases and fine-grained control during high-risk phases. This work proposes TempoVLA, the first VLA framework that enables continuous, bidirectional control over execution speed within a single model. TempoVLA explicitly modulates action outputs through a speed-conditioned policy network and introduces Velocity-aware Semantic Trajectory Augmentation (VSTA), a semantics-preserving method for variable-speed trajectory augmentation that enhances data efficiency. Furthermore, it integrates large-model-assisted reasoning to enable risk-aware dynamic speed adjustment. Experiments demonstrate that TempoVLA achieves precise speed control with negligible motion error across both simulated and real-world tasks, while also significantly improving task performance even at its default execution speed.
📝 Abstract
Robot manipulation alternates between low-risk transit phases that call for fast execution and high-risk contact stages that demand slow, precise motion. Yet existing Vision-Language-Action models (VLAs) only inherit a single fixed speed from training demonstrations. Prior efforts to accelerate VLAs through model compression, KV-cache reuse, or reinforcement learning only shift the policy from one fixed speed to another, and leave deceleration almost unexplored. We observe that the magnitude of each predicted action already governs how fast the robot moves, opening a direct route to controllable execution speed. We turn this observation into TempoVLA, a single VLA whose execution speed is controlled by an explicit condition. TempoVLA combines two coupled components. (1) A data-side Variable-Speed Trajectory Augmentation (VSTA) that re-times demonstration to any target speed by merging or splitting actions while preserving its motion semantics. (2) A model-side conditioning mechanism that feeds the speed to the policy. Statistics show that VSTA reaches the requested speed with negligible motion error. Experiments in simulation and on real-world tasks demonstrate that TempoVLA achieves flexible speed control in both directions, while VSTA additionally boosts the default $1\times$ performance via better data utilization. Furthermore, by cooperating with a large multimodal model, TempoVLA realizes dynamic speed control, accelerating through low-risk phases and decelerating for high-risk ones.