Agents' Last Exam

πŸ“… 2026-06-03
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

157K/year
πŸ€– AI Summary
Current AI systems, despite strong performance on general benchmarks, lack effective evaluation mechanisms for high-value, long-horizon tasks in real-world professional settings. To address this gap, this work introduces a longitudinal evaluation benchmark focused on non-manual occupational tasks, spanning 13 industries, 55 subdomains, and over 1,000 expert-validated tasks developed collaboratively by more than 250 specialists. It uniquely integrates the O*NET/SOC 2018 occupational taxonomy with AI agent assessment and proposes a GDP-impact-driven, dynamically evolving benchmark framework. Emphasizing task authenticity, economic relevance, and verifiable outcomes, the benchmark supports continuous evaluation of mainstream agent architectures. Experimental results reveal that state-of-the-art agents achieve an average full-pass rate of only 2.6% on the most challenging tasks, demonstrating the benchmark’s strong discriminative power and its utility as a quantitative tool to bridge the gap between AI capabilities and real-world economic impact.
πŸ“ Abstract
Recent AI systems have achieved strong results on a wide range of benchmarks, yet these gains have not translated into economically meaningful deployment across many professional domains. We argue that this gap is largely an evaluation problem: widely used benchmarks lack sustained performance measurement on real and economically valuable workflows. This paper introduces Agents' Last Exam (ALE), a benchmark designed to evaluate AI agents on long-horizon, economically valuable, real-world tasks with verifiable outcomes. Developed in collaboration with 250+ industry experts, ALE covers non-physical industries defined with reference to O*NET / SOC 2018 (the U.S. federal occupational taxonomy). It is organized around a task taxonomy with 55 subfields grouped into 13 industry clusters covering 1K+ tasks. Current results show that the hardest tier remains far from saturated: across mainstream harness and backbone configurations, the average full pass rate is 2.6%. ALE is designed as a living benchmark: its task pool grows continuously as new workflows and industries are onboarded. More broadly, ALE is intended not merely as another leaderboard, but as an instrument for closing the gap between benchmark success and GDP-relevant impact.
Problem

Research questions and friction points this paper is trying to address.

AI evaluation
economically valuable tasks
real-world workflows
benchmark gap
professional domains
Innovation

Methods, ideas, or system contributions that make the work stand out.

AI agents
benchmarking
real-world tasks
economic impact
long-horizon evaluation
πŸ”Ž Similar Papers
Yiyou Sun
Yiyou Sun
University of California, Berkeley
Machine Learning
Xinyang Han
Xinyang Han
Southern University of Science and Technology
Robot controlEmbedded system
Weichen Zhang
Weichen Zhang
PhD, University of Sydney
Computer VisionDeep LearningTransfer LearningDomain Adaptation
Y
Yuanbo Pang
University of California, Berkeley
T
Tianyu Wang
University of California, Berkeley
Y
Yuhan Cao
University of California, Berkeley
Yixiao Huang
Yixiao Huang
Tsinghua University
Operations researchvehicle routing problemcity logistics
C
Chris Duroiu
University of California, Berkeley
H
Haoyun Zhang
University of California, Berkeley
Jeffrey Lin
Jeffrey Lin
Federal Reserve Bank of Philadelphia
Urban economicsRegional economicsEconomic growth
W
Weishu Zhang
University of California, Berkeley
T
Tyler Zeng
University of California, Berkeley
Ying Yan
Ying Yan
Microsoft Research
Big Data Management
B
Bo Liu
University of California, Berkeley
H
Hanson Wen
University of California, Berkeley
M
Mingyang Xu
University of California, Berkeley
Xiaoyuan Liu
Xiaoyuan Liu
UC Berkeley
SecuritySystemNLP
Z
Zimeng Chen
University of California, Berkeley
W
Weiyan Shi
University of California, Berkeley
A
Amanda Dsouza
University of California, Berkeley
V
Vincent Sunn Chen
University of California, Berkeley
P
Patrick Bryant
University of California, Berkeley
Carl Boettiger
Carl Boettiger
UC Berkeley
EcologyConservation BiologyStochastic SystemsTipping PointsDynamical Systems
Y
Yamini Rangan
University of California, Berkeley
B
Bradley Rothenberg
University of California, Berkeley