🤖 AI Summary
Existing vision-language model (VLM) agents exhibit limited adaptability in dynamic environments such as web navigation and incur high costs when fine-tuned. This work proposes a training-free, inference-time optimization approach that decouples action generation from selection: the VLM is frozen to generate candidate actions, while a lightweight, offline-trained Q-function reranks these candidates. Notably, this is the first method to directly employ a Q-function for action selection during inference without updating the underlying policy, enabling immediate performance gains. Evaluated on the WebVoyager benchmark, the approach boosts the success rate of Qwen2.5-VL-7B from 38.8% to 55.7% and GPT-4.1 from 82.4% to 88.8%, significantly outperforming baseline methods.
📝 Abstract
Vision-Language Models (VLMs) have become powerful backbones for agents to autonomously operate in digital environments like the web and operating systems. However, these models suffer from inadaptability to fast-changing environments like the web, which can be alleviated by fine-tuning requiring expansive model training and data collection. In this work, we introduce a novel paradigm for enhancing agentic VLM policies at inference without policy retraining. Fundamentally, our approach decouples the VLM's role as a high-capacity action proposer from the final action selection mechanism. We keep the VLM policy frozen and use it to generate a set of candidate actions for a given state. Then, a lightweight, offline-trained Q-function reranks these candidates, and the agent executes the action with the highest estimated value. The main contribution is to apply the Q-function directly during inference for immediate policy improvement, and not offline to relabel data for policy retraining. We demonstrate on the academic WebVoyager benchmark that our method significantly boosts agent success rates, improving a Qwen2.5-VL-7B agent from 38.8% to 55.7% and a proprietary GPT-4.1 agent from 82.4% to 88.8%.