🤖 AI Summary
This work addresses the imprecise perception-to-action mapping in robotic systems caused by semantic misalignment between vision-language models and control policies. To bridge this gap, the authors propose a unified framework that explicitly models structured affordances as an intermediate representation linking perception and action for the first time. The framework comprises three modules—Which2Act, Where2Act, and How2Act—and integrates visual semantics, affordance graph prediction, and 3D geometric reasoning through a Mixture-of-Transformers architecture, a three-stage progressive training strategy, and automated data augmentation. Experimental results demonstrate that the proposed approach significantly improves task success rates and robustness in both simulated and real-world environments, showing strong generalization across diverse manipulation scenarios.
📝 Abstract
Vision-Language-Action (VLA) models leverage the rich world knowledge of pretrained vision-language models (VLMs) to enable instruction-following robotic manipulation. However, the structural mismatch between VLM semantic spaces and embodied control policies often hinders the learning of precise perception--action mappings. To address this challenge, we propose \textbf{AffordanceVLA}, a unified framework that introduces structured affordance forecasting as a task-oriented intermediate representation to establish a more precise and robust perception--action mapping. Specifically, we progressively model manipulation priors through three complementary components: 1) \textbf{Which2Act} for object-centric grounding via visual latent prediction to suppress distractions; 2) \textbf{Where2Act} for 2D interaction localization via affordance map estimation; and 3) \textbf{How2Act} for 3D geometric reasoning to guide manipulation policies. These affordance cues provide spatially grounded, semantically conditioned, and action-coupled intermediate representations, thereby naturally bridging vision, language and action. We integrate these modules into a Mixture-of-Transformer (MoT) architecture with specialized experts and train the model using a three-stage training strategy with a progressive data curriculum. To overcome the scarcity of dense affordance labels in robotic datasets, we also develop a robust automated data augmentation pipeline. Extensive experiments on simulation and real-world demonstrate that AffordanceVLA achieves strong performance across diverse manipulation scenarios.