🤖 AI Summary
Existing vision-language-action (VLA) models struggle to satisfy quantitative task constraints—such as precisely halting grasping upon reaching a target weight—because their action mappings rely on implicit data-driven modeling without explicit condition monitoring. This work introduces CLAW, the first weight-aware robotic grasping framework. CLAW decouples conditional reasoning from action generation: it processes real-time weight feedback via a fine-tuned CLIP model to produce discrete symbolic commands, which then modulate a flow-based VLA policy π₀ for continuous vision-language-motor control. By integrating symbolic weight reasoning with end-to-end action generation, CLAW achieves robust target-weight accuracy in both single-object grasping and bimanual cooperative manipulation of heterogeneous objects. It significantly outperforms both the base policy π₀ and its fine-tuned variants, demonstrating the efficacy of explicit, symbol-mediated constraint enforcement in VLA systems.
📝 Abstract
Vision-language-action (VLA) models have recently emerged as a promising paradigm for robotic control, enabling end-to-end policies that ground natural language instructions into visuomotor actions. However, current VLAs often struggle to satisfy precise task constraints, such as stopping based on numeric thresholds, since their observation-to-action mappings are implicitly shaped by training data and lack explicit mechanisms for condition monitoring. In this work, we propose CLAW (CLIP-Language-Action for Weight), a framework that decouples condition evaluation from action generation. CLAW leverages a fine-tuned CLIP model as a lightweight prompt generator, which continuously monitors the digital readout of a scale and produces discrete directives based on task-specific weight thresholds. These prompts are then consumed by $π_0$, a flow-based VLA policy, which integrates the prompts with multi-view camera observations to produce continuous robot actions. This design enables CLAW to combine symbolic weight reasoning with high-frequency visuomotor control. We validate CLAW on three experimental setups: single-object grasping and mixed-object tasks requiring dual-arm manipulation. Across all conditions, CLAW reliably executes weight-aware behaviors and outperforms both raw-$π_0$ and fine-tuned $π_0$ models. We have uploaded the videos as supplementary materials.