HLL: Can Agents Cross Humanity's Last Line of Verification?

📅 2026-06-01

📈 Citations: 0

✨ Influential: 0

career value

231K/year

🤖 AI Summary

This study investigates whether multimodal agents can genuinely replace humans in protected workflows, with a focus on CAPTCHA as a boundary test for human-level capabilities. To this end, the authors propose HLL, an interactive benchmark that reframes CAPTCHA as a scenario for evaluating human-replacement competence, introducing a comprehensive evaluation framework incorporating interface perturbations, task variations, and trajectory validation—emphasizing procedural plausibility over mere outcome correctness. Leveraging a closed-loop GUI environment, the framework systematically stress-tests state-of-the-art agents by integrating multimodal perception, action planning, and trajectory verification. Experimental results reveal that current agents remain fragile at this boundary: their performance exhibits high sensitivity to CAPTCHA types, degrades significantly in realistic interface settings, and further declines when required to produce valid interaction trajectories.

📝 Abstract

Multimodal agents are increasingly expected to operate interfaces on behalf of users, raising a central deployment question: can they truly substitute for humans in workflows that services deliberately protect against automation? CAPTCHA verification makes this question concrete. It is not merely a visual puzzle, but a human-verification boundary placed before account creation, content access, form submission, and other protected actions. We introduce \textbf{Humanity's Last Line of Verification (HLL)}, a controlled benchmark that uses interactive CAPTCHA verification to evaluate whether agents can cross this boundary through grounded, human-like interaction rather than recognition alone. HLL covers diverse CAPTCHA interactions and exposes agents to controlled realism stressors, including cluttered webpages, harder task variants, and trace-conditioned validation of the solving process. We evaluate eight frontier multimodal agents in a closed-loop GUI environment. The results show that current agents remain brittle at this human-substitution boundary: performance varies sharply across verification types, degrades under realistic interface conditions, and drops further when correct answers must be supported by valid action traces. By exposing gaps in localization, action calibration, state tracking, and process consistency, HLL provides a concrete testbed for measuring how close multimodal agents are to acting as human substitutes in protected real-world workflows. Our code is available at https://github.com/XinhaoS0101/HLL

Problem

Research questions and friction points this paper is trying to address.

CAPTCHA

human verification

multimodal agents

human substitution

verification boundary

Innovation

Methods, ideas, or system contributions that make the work stand out.

multimodal agents

CAPTCHA verification

human substitution