See Less, Specify More: Visual Evidence Budgets for Generalizable VLAs

📅 2026-06-01

📈 Citations: 0

✨ Influential: 0

career value

150K/year

🤖 AI Summary

This work addresses the limited generalization of vision-language-action (VLA) models in the presence of distractors, appearance variations, and semantically similar tasks, as well as their difficulty in inferring fine-grained execution details from coarse instructions. To tackle these challenges, the authors propose a “see less, specify more” framework that retains high-level task instructions while generating fine-grained subtask-level language guidance through trajectory relabeling. An explicit visual evidence budget mechanism is introduced to constrain the policy to rely only on task-sufficient local visual information for decision-making. Notably, the approach requires no region or mask annotations and leverages the in-context learning capabilities of off-the-shelf vision-language models. Evaluated across eight real-world robotic tasks on the TX-G2 and HSR platforms, the method improves average subtask success rates from 54.2% to 79.0%, substantially outperforming the pi0.5 baseline.

📝 Abstract

Generalization remains a central bottleneck for vision-language-action (VLA) models: under distractors, appearance shifts, and semantically similar tasks, the policy must often infer local execution details from coarse instructions while also deciding which parts of the image matter for control. We present S2 (See Less, Specify More), a framework for improving VLA generalization by training the executor under a cleaner interface. Specify More preserves the original instruction as a stable high-level goal while relabeling each trajectory into refined trajectory- and subtask-level language that disambiguates the current execution mode. Unlike native attention, See Less imposes an explicit visual evidence budget, training the executor to act from task-sufficient evidence rather than unconstrained visual context, without any region or mask annotation. This interface lets the executor follow detailed guidance without relying on distracting visual patches or resolving avoidable ambiguity on its own, and it remains compatible with off-the-shelf VLM planners through in-context learning. Across our main evaluation settings, S2 improves overall generalization metrics by changing the executor's learning problem: coarse instructions induce avoidable supervision aliasing, goal-preserving local guidance outperforms instruction replacement in our main ablations, and explicit evidence budgeting reduces dependence on broad visual context beyond efficiency considerations. Across eight real-robot tasks on TX-G2 (an AgiBot G2-compatible variant) and HSR, S2 raises mean subtask success from 54.2% to 79.0% over pi0.5. Together, these results suggest that VLA generalization improves when the executor is trained to act from informative local guidance and task-sufficient visual evidence, rather than recovering both from weak supervision.

Problem

Research questions and friction points this paper is trying to address.

generalization

vision-language-action

visual evidence

instruction ambiguity

task execution

Innovation

Methods, ideas, or system contributions that make the work stand out.

visual evidence budget

trajectory-level language relabeling

VLA generalization