Toward Generalist Autonomous Research via Hypothesis-Tree Refinement

📅 2026-06-10

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the challenge of enabling AI agents to autonomously advance cumulative scientific research over extended timeframes. The authors propose Arbor, a novel framework that organizes hypotheses into an evolvable, persistent tree structure, facilitating cross-iteration integration and inheritance of hypotheses, evidence, and insights through a synergistic architecture comprising a long-term coordinator, short-term executors, and a Hypothesis Tree Refinement (HTR) mechanism. Key technical innovations include isolated working-tree execution, global policy coordination, knowledge distillation, and frontier-aware search optimization. Evaluated on six real-world scientific tasks, Arbor substantially outperforms existing approaches, achieving more than twice the average relative gain over Codex and Claude Code. On MLE-Bench Lite, it attains an 86.36% Any Medal score using GPT-5.5.

📝 Abstract

Scientific progress depends on a repeated loop of exploration, experimentation, and abstraction. Researchers test candidate directions, interpret the evidence, and carry the resulting lessons into later attempts. We study how an AI agent can run this loop autonomously over long horizons. We introduce Arbor, a general framework for autonomous research that combines a long-lived coordinator, short-lived executors, and Hypothesis Tree Refinement (HTR), a persistent tree that links hypotheses, artifacts, evidence, and distilled insights across time. The coordinator manages global research strategy over the tree, while executors implement and test individual hypotheses in isolated worktrees. As results return, Arbor updates the tree, propagates reusable lessons, refines the search frontier, and admits verified improvements. This design turns autonomous research from a sequence of local attempts into a cumulative process in which strategy, execution, and evidence are carried across time. We evaluate Arbor under Autonomous Optimization (AO), an operational setting where an agent improves an initial research artifact through iterative experimentation without step-level human supervision. Across six real research tasks in model training, harness engineering, and data synthesis, Arbor achieves the best held-out result on all six tasks, attaining more than 2.5x the average relative held-out gain of Codex and Claude Code under the same task interface and resource budget. On MLE-Bench Lite, Arbor reaches 86.36% Any Medal with GPT-5.5, the strongest result in our comparison.

Problem

Research questions and friction points this paper is trying to address.

autonomous research

hypothesis refinement

long-horizon AI

scientific discovery

iterative experimentation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hypothesis Tree Refinement

Autonomous Research

Arbor Framework