π€ AI Summary
Current AI systems, despite strong performance on general benchmarks, lack effective evaluation mechanisms for high-value, long-horizon tasks in real-world professional settings. To address this gap, this work introduces a longitudinal evaluation benchmark focused on non-manual occupational tasks, spanning 13 industries, 55 subdomains, and over 1,000 expert-validated tasks developed collaboratively by more than 250 specialists. It uniquely integrates the O*NET/SOC 2018 occupational taxonomy with AI agent assessment and proposes a GDP-impact-driven, dynamically evolving benchmark framework. Emphasizing task authenticity, economic relevance, and verifiable outcomes, the benchmark supports continuous evaluation of mainstream agent architectures. Experimental results reveal that state-of-the-art agents achieve an average full-pass rate of only 2.6% on the most challenging tasks, demonstrating the benchmarkβs strong discriminative power and its utility as a quantitative tool to bridge the gap between AI capabilities and real-world economic impact.
π Abstract
Recent AI systems have achieved strong results on a wide range of benchmarks, yet these gains have not translated into economically meaningful deployment across many professional domains. We argue that this gap is largely an evaluation problem: widely used benchmarks lack sustained performance measurement on real and economically valuable workflows. This paper introduces Agents' Last Exam (ALE), a benchmark designed to evaluate AI agents on long-horizon, economically valuable, real-world tasks with verifiable outcomes. Developed in collaboration with 250+ industry experts, ALE covers non-physical industries defined with reference to O*NET / SOC 2018 (the U.S. federal occupational taxonomy). It is organized around a task taxonomy with 55 subfields grouped into 13 industry clusters covering 1K+ tasks. Current results show that the hardest tier remains far from saturated: across mainstream harness and backbone configurations, the average full pass rate is 2.6%. ALE is designed as a living benchmark: its task pool grows continuously as new workflows and industries are onboarded. More broadly, ALE is intended not merely as another leaderboard, but as an instrument for closing the gap between benchmark success and GDP-relevant impact.