RE-Bench: Evaluating frontier AI R&D capabilities of language model agents against human experts

📅 2024-11-22
🏛️ arXiv.org
📈 Citations: 24
Influential: 1
📄 PDF
🤖 AI Summary
Existing AI development capability evaluations lack high-fidelity, human-comparable benchmarks. Method: We introduce RE-Bench—the first high-fidelity, open benchmark for ML research engineering—comprising seven realistic R&D tasks and 8-hour empirical performance data from 61 human experts. Our evaluation framework balances ecological validity and comparability, integrating multi-model best-of-k sampling, autonomous planning agents, Triton kernel generation, and automated experimental validation. Contribution/Results: RE-Bench reveals a critical performance inflection point: top AI agents achieve 4× human scores under 2-hour budgets, yet humans outperform them by 2× at 32 hours. Notably, one agent generated a Triton kernel surpassing all human-expert implementations in performance. All benchmark environments, expert trajectories, and agent behavior logs are publicly released to support reproducible, human-aligned AI evaluation.

Technology Category

Application Category

📝 Abstract
Frontier AI safety policies highlight automation of AI research and development (R&D) by AI agents as an important capability to anticipate. However, there exist few evaluations for AI R&D capabilities, and none that are highly realistic and have a direct comparison to human performance. We introduce RE-Bench (Research Engineering Benchmark, v1), which consists of 7 challenging, open-ended ML research engineering environments and data from 71 8-hour attempts by 61 distinct human experts. We confirm that our experts make progress in the environments given 8 hours, with 82% of expert attempts achieving a non-zero score and 24% matching or exceeding our strong reference solutions. We compare humans to several public frontier models through best-of-k with varying time budgets and agent designs, and find that the best AI agents achieve a score 4x higher than human experts when both are given a total time budget of 2 hours per environment. However, humans currently display better returns to increasing time budgets, narrowly exceeding the top AI agent scores given an 8-hour budget, and achieving 2x the score of the top AI agent when both are given 32 total hours (across different attempts). Qualitatively, we find that modern AI agents possess significant expertise in many ML topics -- e.g. an agent wrote a faster custom Triton kernel than any of our human experts' -- and can generate and test solutions over ten times faster than humans, at much lower cost. We open-source the evaluation environments, human expert data, analysis code and agent trajectories to facilitate future research.
Problem

Research questions and friction points this paper is trying to address.

Evaluating AI R&D capabilities against human experts
Assessing AI agent performance in realistic ML research tasks
Comparing human and AI efficiency in solving engineering challenges
Innovation

Methods, ideas, or system contributions that make the work stand out.

RE-Bench evaluates AI vs human R&D capabilities
AI agents generate solutions 10x faster than humans
Open-source environments and data for future research
🔎 Similar Papers
No similar papers found.
Hjalmar Wijk
Hjalmar Wijk
METR
T
Tao Lin
Model Evaluation and Threat Research (METR)
Joel Becker
Joel Becker
Member of Technical Staff, METR; CEO, Qally's
Machine Learning
S
Sami Jawhar
Model Evaluation and Threat Research (METR)
Neev Parikh
Neev Parikh
METR, Stripe, Brown University
AI capabilities evaluationsreinforcement learningartificial intelligencemachine learning
T
Thomas Broadley
Model Evaluation and Threat Research (METR)
L
Lawrence Chan
Model Evaluation and Threat Research (METR)
Michael Chen
Michael Chen
Undergraduate, Carnegie Mellon University
J
Josh Clymer
Model Evaluation and Threat Research (METR)
J
Jai Dhyani
Model Evaluation and Threat Research (METR)
E
Elena Ericheva
Model Evaluation and Threat Research (METR)
K
Katharyn Garcia
Model Evaluation and Threat Research (METR)
B
Brian Goodrich
Model Evaluation and Threat Research (METR)
N
Nikola Jurkovic
Model Evaluation and Threat Research (METR)
M
Megan Kinniment
Model Evaluation and Threat Research (METR)
A
Aron Lajko
Model Evaluation and Threat Research (METR)
S
Seraphina Nix
Model Evaluation and Threat Research (METR)
L
L. Sato
Model Evaluation and Threat Research (METR)
William Saunders
William Saunders
OpenAI
AI AlignmentAI SafetyDeep Reinforcement LearningNatural Language ProcessingMachine Learning
M
Maksym Taran
Model Evaluation and Threat Research (METR)
B
Ben West
Model Evaluation and Threat Research (METR)
Elizabeth Barnes
Elizabeth Barnes
Model Evaluation and Threat Research (METR)