🤖 AI Summary
Current evaluation methods for computer-use agents (CUAs) rely on static benchmarks or human judgments, which struggle to enable scalable and reliable automated assessment across diverse desktop environments. This work presents the first systematic study employing vision-language models (VLMs) as autonomous auditors to automatically judge task completion based on natural language instructions and final interface states. We conduct large-scale meta-evaluations across three major CUA benchmarks on macOS, Windows, and Linux. Our experiments reveal that while state-of-the-art VLMs achieve high accuracy and well-calibrated confidence on standard tasks, their performance degrades significantly in complex or heterogeneous environments. Moreover, substantial discrepancies persist among top-performing models, highlighting fundamental limitations in current VLM-based auditing approaches.
📝 Abstract
Computer-Use Agents (CUAs) are emerging as a new paradigm in human-computer interaction, enabling autonomous execution of tasks in desktop environment by perceiving high-level natural-language instructions. As such agents become increasingly capable and are deployed across diverse desktop environments, evaluating their behavior in a scalable and reliable manner becomes a critical challenge. Existing evaluation pipelines rely on static benchmarks, rule-based success checks, or manual inspection, which are brittle, costly, and poorly aligned with real-world usage. In this work, we study Vision-Language Models (VLMs) as autonomous auditors for assessing CUA task completion directly from observable interactions and conduct a large-scale meta-evaluation of five VLMs that judge task success given a natural-language instruction and the final environment state. Our evaluation spans three widely used CUA benchmarks across macOS, Windows, and Linux environments and analyzes auditor behavior along three complementary dimensions: accuracy, calibration of confidence estimates, and inter-model agreement. We find that while state-of-the-art VLMs achieve strong accuracy and calibration, all auditors exhibit notable performance degradation in more complex or heterogeneous environments, and even high-performing models show significant disagreement in their judgments. These results expose fundamental limitations of current model-based auditing approaches and highlight the need to explicitly account for evaluator reliability, uncertainty, and variance when deploying autonomous CUAs in real-world settings.