A History-Aware Visually Grounded Critic for Computer Use Agents

📅 2026-06-09

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing GUI agent critic models struggle to effectively correct errors in complex interactions due to myopic decision-making and insufficient visual grounding. This work proposes HiViG, a novel framework that uniquely integrates high-level action history abstraction with multimodal visual localization for criticism. HiViG compresses interaction histories into abstracted macro-action records and leverages current screen screenshots to visually verify execution coordinates before action execution, thereby intercepting errors proactively. This approach substantially mitigates short-sighted planning and enhances test-time generalization in long-horizon tasks. Evaluated across web, mobile, and desktop benchmarks, HiViG improves the average success rates of Qwen3-VL-32B and Gemini-3-Flash by 5.8% and 9.0%, respectively, demonstrating strong cross-platform performance.

📝 Abstract

Various test-time interventions for Computer Use Agents (CUAs), including critic models, have been developed to improve performance through pre-execution action evaluation in complex Graphical User Interface (GUI) environments. However, existing critics suffer from two key limitations: they (1) focus primarily on short-sighted decision loops (e.g., forgetting earlier actions) and (2) lack the visual grounding needed to detect flawed actions (e.g., clicking wrong UI elements). To address these, we introduce HiViG, a History-aware Visually Grounded test-time framework, built around a multimodal critic trained on real GUI trajectories to abstract past interactions into a compact record and to evaluate actions with visual grounding. At test time, HiViG integrates the critic into the policy decision loop to provide macro-action history, which summarizes the policy's completed achievements, and visually grounded critique, which verifies raw execution coordinates against the current screenshot to intercept errors before execution. Across web, mobile, and desktop benchmarks, HiViG consistently outperforms existing scalar and verbal critics, improving average success rates over the strongest baseline by 5.8% for Qwen3-VL-32B and 9.0% for Gemini-3-Flash, and demonstrates strong cross-platform generalization. Ablations show that macro-action history mitigates short-sighted planning and visually grounded critique reduces execution errors, with both components being critical for test-time scaling in long-horizon GUI tasks.

Problem

Research questions and friction points this paper is trying to address.

Computer Use Agents

critic models

visual grounding

short-sighted decision

GUI environments

Innovation

Methods, ideas, or system contributions that make the work stand out.

History-aware

Visually Grounded

Computer Use Agents