🤖 AI Summary
Existing vision-language models (VLMs) suffer from low navigation accuracy in complex GUI interactions, opaque decision-making (“black-box” behavior), high fine-tuning costs, and suboptimal trajectory-level optimization—particularly due to delayed reward feedback and local convergence. To address these challenges, this paper proposes a *process-reward-guided inference-time framework*. Its core innovation is the first integration of a fine-grained, action-level process reward model directly into the VLM’s inference pipeline, enabling dynamic, stepwise policy guidance. The framework further incorporates GUI state awareness, trajectory-level reflection, and iterative retry mechanisms. Evaluated in static environments, it improves single-step action accuracy by 3.4%; on dynamic GUI tasks, task success rate increases by 33%. Reflection and retry yield additional performance gains. Critically, the method requires no model fine-tuning, effectively mitigating both black-box opacity and delayed-feedback bottlenecks.
📝 Abstract
Recent advancements in visual language models (VLMs) have notably enhanced their capabilities in handling complex Graphical User Interface (GUI) interaction tasks. Despite these improvements, current frameworks often struggle to generate correct actions in challenging GUI environments. State-of-the-art commercial VLMs are black-boxes, and fine-tuning open-source VLMs for GUI tasks requires significant resources. Additionally, existing trajectory-level evaluation and refinement techniques frequently fall short due to delayed feedback and local optimization issues. To address these challenges, we propose an approach that guides VLM agents with process supervision by a reward model during GUI navigation and control at inference time. This guidance allows the VLM agent to optimize actions at each inference step, thereby improving performance in both static and dynamic environments. In particular, our method demonstrates significant performance gains in three GUI navigation tasks, achieving a 3.4% improvement in single step action accuracy for static environments, along with a around 33% increase in task success rate in one dynamic environment. With further integration of trajectory reflection and retry mechanisms, we also demonstrate even greater enhancement in task success.