Let It Be Simple: One-Step Action Generation for Vision-Language-Action Models

📅 2026-06-04

📈 Citations: 0

✨ Influential: 0

career value

218K/year

🤖 AI Summary

This work addresses the computational inefficiency of existing vision-language-action (VLA) diffusion models, which rely on multi-step denoising generation and suffer from redundant computation. The authors propose a single-step action generation method that biases the sampling of high-noise timesteps during standard diffusion training, enabling efficient action prediction without requiring teacher models, distillation, or auxiliary objectives. This approach exploits the inherent asymmetry between conditioning inputs and target actions in VLA tasks, demonstrating that merely reshaping the noise distribution during training suffices to match or even surpass the performance of multi-step decoding. Experiments show that the single-step strategy achieves parity with ten-step decoding on the LIBERO benchmark suite; notably, when integrated with a 1.4B vision-language model, it attains a 95.6% success rate on LIBERO-Long and demonstrates strong effectiveness in real-world bimanual robot tasks.

📝 Abstract

Diffusion-based vision-language-action (VLA) models often inherit the image-generation view: actions are generated by iterative denoising. We argue that VLA action generation has a different condition-target structure: the policy is conditioned on rich observations, language, and state, but predicts only a compact, low-dimensional action chunk. Under this asymmetry, strong one-step action generation should not necessarily require the advanced one-step methods developed for image synthesis. We keep standard velocity prediction and add no teacher model, distillation stage, or auxiliary objective; in our main recipe, we simply bias the training time distribution toward high-noise states. We first isolate the effect in a controlled MNIST grid-to-sequence task, then test it with extensive robot-policy experiments. Across standard LIBERO, LIBERO-Plus, and LIBERO-Pro, one-step policies trained with high-noise biased schedules generally match ten-step decoding under the same recipe, and on standard LIBERO can exceed ten-step policies trained with a uniform time distribution. A real-robot bimanual YAM RSS evaluation gives a small-sample cross-architecture check of the same sampler trend. On a 1.4B VLM model with a 30M action head, one-step decoding reaches 95.6\% on LIBERO-Long. These results show that strong one-step VLA action generation can emerge from standard diffusion training, without importing the full few-step diffusion machinery developed for image generation.

Problem

Research questions and friction points this paper is trying to address.

vision-language-action

diffusion models

action generation

one-step generation

robot policy

Innovation

Methods, ideas, or system contributions that make the work stand out.

one-step action generation

vision-language-action models

diffusion policy