🤖 AI Summary
This work addresses the limitation of existing vision-language-action (VLA) models that treat the action space as unstructured, thereby neglecting geometric proximity among actions and hindering policy learning efficiency and generalization. To overcome this, the authors propose ActionMap—a voxel-based action heatmap decoder that introduces structured voxel representations into VLA frameworks for the first time, explicitly modeling the geometric structure of the action space. By predicting a probability distribution over voxels in the action space, ActionMap generates continuous control signals and integrates seamlessly into standard VLA architectures. Experimental results demonstrate that ActionMap significantly outperforms current methods—e.g., achieving an 8.2% improvement over OpenVLA-OFT—on both the LIBERO simulation benchmark and real-world Franka robot platforms, with faster convergence and enhanced data efficiency and generalization, particularly in low-data regimes.
📝 Abstract
Vision-language-action (VLA) models have advanced rapidly across backbones, training recipes, and data scale, yet the action decoder, which converts the backbone's hidden state into a continuous control signal, has barely changed and remains a single-point predictor across the majority of current VLAs. Whether implemented via autoregressive token bins, L1 regression, or flow-matching denoising, the resulting decoder treats the action space as unstructured, leaving the geometric proximity of neighboring actions unexploited during training. To advance this, we introduce ActionMap, a voxel heatmap action head that drops into an existing VLA in place of its native action decoder. For each new action, the head predicts a voxel heatmap over the action space, where each voxel directly stores the probability of the corresponding action. Across LIBERO simulation and real-world Franka manipulation, our heatmap head surpasses two architecturally distinct backbones at matched training steps (e.g., +8.2% over OpenVLA-OFT's L1 regression head on the LIBERO four-suite average), converges at comparable or faster rates on both backbones, and remains markedly more data-efficient at low training data. The cross-backbone consistency indicates that action representation is a real lever for VLA performance, distinct from further backbone or recipe scaling. Project Page: https://github.com/showlab/ActionMap.