🤖 AI Summary
This work addresses the lack of reliable evaluation benchmarks for clinical computing agents tailored to medical settings, which must account for domain-specific knowledge, interface characteristics, and stringent safety and privacy requirements. The authors introduce the first screenshot-based interactive benchmark, encompassing 18 real-world clinical tasks across 10 medical specialties. By decoupling intent-level and step-level objectives, the framework separately evaluates agents’ clinical reasoning and operational capabilities. A deterministic checker enables automated assessment based on task completion and five clinical safety dimensions. The benchmark reconstructs authentic clinical interfaces using real product documentation and open-source systems like OpenEMR, thereby replicating realistic environments while avoiding privacy and licensing constraints. Experimental results reveal a significant performance gap: the best closed-source model achieves only a 54.2% strict success rate, open-source models average 2.5%, and all models fall below 9% when evaluated on actual clinical systems, underscoring their current inadequacy for reliable clinical deployment.
📝 Abstract
Computer-use agents could automate repetitive screen-based clinical work, but their reliability in medical graphical user interfaces remains largely unvalidated. Existing benchmarks focus on general web or desktop tasks and underrepresent medical software, which requires domain knowledge, exhibits markedly different UI design from mainstream applications, lacks public testing environments, and demands safety validation beyond task completion. We introduce MedCUA-Bench, an interactive benchmark for clinical computer-use agents. It covers 18 clinical scenarios across 10 medical domains, reconstructed from real product manuals and open-source medical systems to capture authentic clinical interfaces while avoiding licensing and privacy constraints. Each task ships with paired intent- and step-level goals to disentangle clinical reasoning from UI execution, and is evaluated by a deterministic checker over task completion and five clinical safety dimensions. Across 23 agents, the best closed-source model reaches 54.2% strict success, while all models remain below 9% on the real OpenEMR. Open-source agents average only 2.5%, with the best reaching 16.2%. MedCUA-Bench exposes the gap between current agents and reliable clinical software use, providing a reproducible testbed for future research.