Act As a Real Researcher: A Suite of Benchmarks Evaluating Frontier LLMs and Agentic Harnesses in Research Lifecycle

📅 2026-06-05

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Current large language models and AI agents still lack domain sensitivity, ethical awareness, and fine-grained scientific reasoning capabilities, rendering them inadequate substitutes for human researchers in scientific tasks. To address this gap, this work proposes the AARR benchmark suite—the first evaluation framework grounded in the behavioral characteristics of human researchers—that comprehensively assesses fine-grained competencies across the entire research lifecycle, moving beyond existing benchmarks that focus solely on macro-level task execution. The inaugural sub-benchmark, AARRI-Bench, systematically evaluates state-of-the-art models (e.g., Claude Opus 4.7) integrated with agent frameworks (e.g., Mini-SWE-Agent). Experimental results reveal that even the best-performing configuration achieves only a 68.3% success rate and frequently overlooks critical details, underscoring that achieving human-like scientific intelligence requires deep modeling of research behaviors rather than reliance on increasingly complex architectures.

📝 Abstract

As foundation models advance and agent scaffolding becomes increasingly sophisticated, agents have demonstrated remarkable proficiency in complex, long-horizon coding tasks and even autonomous experiment execution. Despite their evolution from research assistants into autonomous research agents, these systems still exhibit significant limitations in field sensitivity, research ethics, and nuanced scientific judgment. Consequently, frontier agents remain unable to fully replace human researchers. To bridge this gap, we conceptualize the AARR (Act As a Real Researcher) benchmark series. Unlike existing benchmarks that primarily assess macro-level execution capabilities, AARR focuses on whether agents can emulate the professionalism, thoroughness, and nuanced reasoning that characterize human researchers in granular research scenarios. In this work, we propose AARRI-Bench (Act As a Real Research Intern), the first benchmark in this series. We conduct extensive experiments across frontier models and agentic systems, revealing that even the best-performing configuration (Mini-SWE-Agent with Claude Opus 4.7) achieves only 68.3\% success rate, frequently overlooking subtle yet critical details that are obvious to real human researchers. Our results indicate that developing researcher-like AI requires further exploration of research behavior, rather than merely complex scaffolding. Our data is released at https://github.com/AARR-bench/AARRI-bench.

Problem

Research questions and friction points this paper is trying to address.

research agent

scientific reasoning

LLM benchmark

research ethics

field sensitivity

Innovation

Methods, ideas, or system contributions that make the work stand out.

research agent

LLM benchmark

scientific reasoning