AutoLab: Can Frontier Models Solve Long-Horizon Auto Research and Engineering Tasks?

📅 2026-06-03
📈 Citations: 0
Influential: 0
📄 PDF

career value

264K/year
🤖 AI Summary
Current evaluations of large language models are largely confined to single-turn or short-horizon tasks, failing to assess their capabilities in long-term scientific research and engineering optimization. This work proposes AutoLab—the first benchmark designed for ultra-long-horizon autonomous optimization—comprising 36 expert-crafted closed-loop tasks spanning system optimization, puzzle solving, model development, and CUDA kernel tuning, with an emphasis on temporal awareness and iterative refinement. The benchmark incorporates real-world engineering scenarios, closed-loop feedback mechanisms, and strict wall-clock time constraints, and all associated tools are open-sourced. Evaluation across 17 state-of-the-art models reveals that sustained experimentation and effective integration of feedback—not the quality of initial proposals—are key to success; Claude-Opus-4.6 achieves the best performance, while most models show limited progress due to premature termination.
📝 Abstract
Scientific and engineering progress is fundamentally a long-horizon iterative process: proposing changes, running experiments, measuring outcomes, and continuously refining artifacts. Yet existing benchmarks for frontier models primarily evaluate either single-turn responses or short-horizon agent trajectories, failing to capture the challenges of sustained iterative improvement over extended time horizons. To address this gap, we introduce AutoLab, a new benchmark for ultra long-horizon closed-loop optimization. AutoLab consists of 36 realistic, expert-curated tasks spanning four diverse domains: system optimization, puzzle & challenge, model development, and CUDA kernel optimization. Each task begins with a correct but deliberately suboptimal baseline and challenges agents to improve it within a strict wall-clock budget. Evaluating 17 state-of-the-art models reveals the dominant predictor of success is not the quality of an agent's initial attempt, but its persistence in repeatedly benchmarking, editing, and incorporating empirical feedback. While claude-opus-4.6 exhibits strong long-horizon optimization capabilities, most frontier models, including several proprietary ones, either terminate prematurely or exhaust their budgets with minimal progress. These results underscore the importance of time awareness and persistent iteration in autonomous agents. We open-source the full benchmark, evaluation harness, and task artifacts, to accelerate research toward truly capable long-horizon agents.
Problem

Research questions and friction points this paper is trying to address.

long-horizon
iterative improvement
autonomous agents
closed-loop optimization
benchmarking
Innovation

Methods, ideas, or system contributions that make the work stand out.

long-horizon optimization
closed-loop iteration
autonomous agents
empirical feedback
benchmarking
🔎 Similar Papers
Zhangchen Xu
Zhangchen Xu
University of Washington
(^._.^)ノSynthetic DataPost-TrainingSafetyFederated Learning
J
Junda Chen
UCSD
Yue Huang
Yue Huang
PhD student, University of Notre Dame
trustworthy AIgenerative modelmachine learningAI for science
Dongfu Jiang
Dongfu Jiang
University of Waterloo
Large Language ModelMultimodality ReasoningEvaluation
J
Jiefeng Chen
Google
Hang Hua
Hang Hua
University of Rochester
Computer VisionNatural Language ProcessingMachine Learning
Zijian Wu
Zijian Wu
PhD Student, School of Computing, National University of Singapore
Large Language ModelsVision-Language ModelsData-Centric AI
Zheyuan Liu
Zheyuan Liu
University of Notre Dame
Large Language ModelAI PrivacyAI FairnessAI SecurityTrustworthy AI
Zexue He
Zexue He
University of California, San Diego
Trustworthy NLPLLM
Lichi Li
Lichi Li
Cisco Systems, Inc.
Large Language ModelsMultimodalityProbabilistic ModelsRecommender Systems
Shizhe Diao
Shizhe Diao
NVIDIA Research
Large Language ModelsNatural Language Processing
Jiaxin Pei
Jiaxin Pei
Stanford University, The University of Texas at Austin
Human-Centered AINLPHuman-Computer InteractionComputational Social Science
Jinsung Yoon
Jinsung Yoon
Research Scientist at Google Cloud AI
Machine LearningDeep Learning
Hao Zhang
Hao Zhang
UC San Diego
Machine LearningSystemsComputer Vision
Mengdi Wang
Mengdi Wang
Professor, Princeton AI Lab, CSML&ECE, Princeton University
LLMAI for scienceagentdata sciencecontrol
Radha Poovendran
Radha Poovendran
Professor of ECE, University of Washington
SecurityGamesLearningNetworksCPS
Misha Sra
Misha Sra
UCSB
Spatial Human-AI InteractionXRHaptics
A
Alex Pentland
MIT
Zichen Chen
Zichen Chen
UC Santa Barbara
Agentic LLMTrustworthy AIAI SafetySynthetic Data