Think Fast: Estimating No-CoT Task-Completion Time Horizons of Frontier AI Models

📅 2026-06-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the implications of frontier AI models performing complex reasoning without explicit chain-of-thought (CoT) prompting, which could undermine CoT-dependent safety alignment mechanisms. The authors present the first systematic evaluation across over 30,000 problems spanning 43 benchmarks—including mathematics, programming, and causal reasoning—and introduce quantifiable thresholds for CoT-free reasoning: the 50% task completion time threshold (TH) and a corresponding reasoning token threshold, establishing metrics comparable to human reasoning efficiency. Their findings reveal that the 50% TH for leading models approximately doubles annually, with GPT-5.5 already exceeding three minutes; projections indicate thresholds surpassing seven minutes by 2028 and 25 minutes by 2030, highlighting the rapid advancement of CoT-free reasoning capabilities and its significant safety implications.
📝 Abstract
Many efforts to ensure frontier AI models are safe rely on monitoring their chain-of-thought (CoT) reasoning. If models become able to perform sufficiently complex reasoning internally, without explicit thinking tokens, this would undermine such oversight. We measure how well frontier models reason without CoT across a suite of over 30,000 questions spanning 43 benchmarks in domains including math, coding, puzzles, causality, theory-of-mind, and strategic reasoning. To compare models against humans, we estimate the $50\%$-task-completion time horizon (TH): the human time required for tasks a model completes with $50\%$ success rate. We complement this with a $50\%$ reasoning token horizon: the minimum number of o3-mini reasoning tokens needed for tasks a model solves with $50\%$ success rate. We find that the no-CoT $50\%$ TH of frontier models has been doubling roughly every year over the past six years, with GPT-5.5's TH reaching over 3 minutes and reasoning token horizon exceeding 1,500 tokens. Our median estimates predict that frontier no-CoT THs could exceed 7 minutes by 2028, and 25 minutes by 2030, though these projections carry substantial uncertainty. We recommend frontier developers track this explicitly.
Problem

Research questions and friction points this paper is trying to address.

Chain-of-Thought
AI safety
reasoning
task-completion time
frontier AI models
Innovation

Methods, ideas, or system contributions that make the work stand out.

no-CoT reasoning
task-completion time horizon
reasoning token horizon
frontier AI models
AI safety monitoring
D
Dewi Gould
Redwood Research, Astra Fellows Program
Francis Rhys Ward
Francis Rhys Ward
Imperial College London
AI alignmentdeceptionsafety evaluations
A
Anders Cairns Woodruff
Redwood Research, Astra Fellows Program
R
Rauno Arike
Aether Research
J
Josh Hills
Astra Fellows Program
Alex Serrano
Alex Serrano
Undergraduate, Polytechnic University of Catalonia
Artificial IntelligenceMachine LearningLarge Language ModelsAI Safety
I
Ida Caspary
Astra Fellows Program, Imperial College London
J
Jason Ross Brown
Astra Fellows Program, University of Cambridge
J
Jo J. Jiao
MATS Research, University of Chicago
Patrick Leask
Patrick Leask
Durham University
Artificial Intelligence
T
Twm Stone
MATS Research
R
Ram Potham
Redwood Research, Astra Fellows Program
I
Ionut Gabriel Stan
MIT
H
Harry Mayne
Astra Fellows Program, University of Oxford
S
Simeon Hellsten
University of Glasgow
S
Shubhorup Biswas
Aether Research
A
Ariana Azarbal
MATS Research
William L. Anderson
William L. Anderson
University of Texas at Austin (retired)
information systemsdigital librariesscientific data informaticsCODATA
E
Elle Najt
Constellation
R
Ryan Greenblatt
Redwood Research
Julian Stastny
Julian Stastny
Center on Long-Term Risk
Machine LearningCooperative AIGame TheoryReinforcement Learning