🤖 AI Summary
Existing benchmarks primarily evaluate AI systems’ retrieval and reporting capabilities, overlooking their potential to generate novel scientific insights at the research frontier. This paper introduces ResearcherBench—the first deep research capability benchmark targeting cutting-edge AI scientific questions—comprising 65 expert-level research problems across 35 AI subfields. We propose a dual-dimensional evaluation framework: (1) expert-guided qualitative assessment of insight quality, and (2) quantitative factual evaluation integrating citation faithfulness and literature groundedness. The benchmark systematically evaluates AI performance across three core research tasks: technical analysis, literature synthesis, and open-ended scientific consultation. Empirical results show that OpenAI and Gemini Deep Research significantly outperform other models, especially on open-ended tasks. ResearcherBench is publicly released to advance AI-augmented scientific discovery and foster next-generation research paradigms.
📝 Abstract
The emergence of deep research systems presents significant capabilities in problem-solving, extending from basic queries to sophisticated research tasks. However, existing benchmarks primarily evaluate these systems as agents for web retrieval and report generation, overlooking their potential to discover novel insights on the frontiers of scientific research. To address this gap, we introduce ResearcherBench, the first benchmark focused on evaluating the capabilities of these advanced, agentic systems - which we refer to as Deep AI Research Systems (DARS) - on frontier AI scientific questions. We compiled a dataset of 65 research questions expertly selected from real-world scientific scenarios such as laboratory discussions and interviews, spanning 35 different AI subjects and categorized into three types: technical details, literature review, and open consulting. Our dual evaluation framework combines rubric assessment, which uses expert-designed criteria to evaluate insight quality, with factual assessment, which measures citation accuracy (faithfulness) and coverage (groundedness). We evaluated several leading commercial DARS and baseline systems. Results show that OpenAI Deep Research and Gemini Deep Research significantly outperform other systems, with particular strength in open-ended consulting questions. Such capabilities represent a meaningful step toward AI self-improvement, aligning with the vision of ASI for AI. We open-source ResearcherBench to provide a standardized platform for promoting the development of next-generation AI research assistants, hoping to foster a new perspective in AI research evaluation for a novel pattern of scientific collaboration: https://github.com/GAIR-NLP/ResearcherBench.