🤖 AI Summary
This work addresses the unreliability of existing computer-use agent (CUA) trajectory validation methods, which distorts evaluation and training signals. To overcome this limitation, the authors propose CUAVerifier, a general-purpose validator grounded in four core principles: non-overlapping scoring criteria, separation of process- and outcome-based rewards, distinction between controllable and uncontrollable failures, and divide-and-conquer context management. Leveraging cascaded error-free failure analysis, a dual-signal mechanism, and a hybrid rule-model architecture, CUAVerifier enables fine-grained trajectory assessment on CUAVerifierBench—a newly constructed, multidimensional, human-annotated benchmark. Experimental results demonstrate that CUAVerifier achieves human-level agreement with expert judgments, exhibits near-zero false positive rates, and significantly outperforms baseline approaches such as WebVoyager and WebJudge.
📝 Abstract
Verifying the success of computer use agent (CUA) trajectories is a critical challenge: without reliable verification, neither evaluation nor training signal can be trusted. In this paper, we present lessons learned from building a best-in-class verifier for web tasks we call the Universal Verifier. We design the Universal Verifier around four key principles: 1) constructing rubrics with meaningful, non-overlapping criteria to reduce noise; 2) separating process and outcome rewards that yield complementary signals, capturing cases where an agent follows the right steps but gets blocked or succeeds through an unexpected path; 3) distinguishing between controllable and uncontrollable failures scored via a cascading-error-free strategy for finer-grained failure understanding; and 4) a divide-and-conquer context management scheme that attends to all screenshots in a trajectory, improving reliability on longer task horizons. We validate these findings on CUAVerifierBench, a new set of CUA trajectories with both process and outcome human labels, showing that our Universal Verifier agrees with humans as often as humans agree with each other. We report a reduction in false positive rates to near zero compared to baselines like WebVoyager ($\geq$ 45\%) and WebJudge ($\geq$ 22\%). We emphasize that these gains stem from the cumulative effect of the design choices above. We also find that an auto-research agent achieves 70\% of expert quality in 5\% of the time, but fails to discover all strategies required to replicate the Universal Verifier. We open-source our Universal Verifier system along with CUAVerifierBench; available at https://github.com/microsoft/fara.