A Rigorous Benchmark with Multidimensional Evaluation for Deep Research Agents: From Answers to Reports

📅 2025-10-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing benchmarks inadequately evaluate Deep Research Agents (DRAs) on complex, open-ended tasks due to narrow evaluation dimensions, mismatched output formats, and coarse-grained scoring mechanisms. Method: We introduce DRA-Bench—the first rigorous, DRA-specific benchmark—comprising 214 expert-crafted questions spanning ten thematic domains, explicitly designed for long-form, report-style outputs. We propose the first multidimensional automated evaluation framework, quantifying performance along three axes: semantic quality, topical focus, and retrieval credibility. This framework integrates human-curated reference packages with cross-source retrieval and multi-stage reasoning for end-to-end fine-grained assessment. Contribution/Results: Experiments reveal that state-of-the-art DRAs significantly outperform retrieval-augmented reasoning models; however, they exhibit notable deficiencies in information integration and logical consistency. DRA-Bench establishes a reliable, diagnostic foundation for evaluating and advancing DRA capabilities.

Technology Category

Application Category

📝 Abstract
Artificial intelligence is undergoing the paradigm shift from closed language models to interconnected agent systems capable of external perception and information integration. As a representative embodiment, Deep Research Agents (DRAs) systematically exhibit the capabilities for task decomposition, cross-source retrieval, multi-stage reasoning, and structured output, which markedly enhance performance on complex and open-ended tasks. However, existing benchmarks remain deficient in evaluation dimensions, response formatting, and scoring mechanisms, limiting their capacity to assess such systems effectively. This paper introduces a rigorous benchmark and a multidimensional evaluation framework tailored to DRAs and report-style responses. The benchmark comprises 214 expert-curated challenging queries distributed across 10 broad thematic domains, each accompanied by manually constructed reference bundles to support composite evaluation. The framework enables comprehensive evaluation of long-form reports generated by DRAs, incorporating integrated scoring metrics for semantic quality, topical focus, and retrieval trustworthiness. Extensive experimentation confirms the superior performance of mainstream DRAs over web-search-tool-augmented reasoning models, yet reveals considerable scope for further improvement. This study provides a robust foundation for capability assessment, architectural refinement, and paradigm advancement in DRA systems.
Problem

Research questions and friction points this paper is trying to address.

Existing benchmarks lack multidimensional evaluation for deep research agents
Current scoring mechanisms cannot effectively assess report-style responses
There is no rigorous framework for evaluating complex agent capabilities
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multidimensional evaluation framework for deep research agents
Benchmark with expert-curated queries across diverse domains
Integrated scoring metrics for semantic quality and trustworthiness
🔎 Similar Papers
No similar papers found.
Y
Yang Yao
Shanghai Artificial Intelligence Laboratory, The University of Hong Kong
Y
Yixu Wang
Shanghai Artificial Intelligence Laboratory, Fudan University
Y
Yuxuan Zhang
University of British Columbia
Y
Yi Lu
University of Toronto
Tianle Gu
Tianle Gu
Tsinghua University
(M)LLM SafetyPEFT
Lingyu Li
Lingyu Li
Shanghai Jiao Tong University
Active inferenceArtificial Intelligencephilosophy
D
Dingyi Zhao
Shanghai Artificial Intelligence Laboratory, Shanghai Jiao Tong University
Keming Wu
Keming Wu
Ph.D. Student, Tsinghua University
Computer VisionVision Language ModelsGenerative AI
H
Haozhe Wang
Hong Kong University of Science and Technology
Ping Nie
Ping Nie
Waterloo University
Natural Language ProcessingInformation RetrievalRecommendation SystemsTime Series Forecasting
Y
Yan Teng
Shanghai Artificial Intelligence Laboratory
Y
Yingchun Wang
Shanghai Artificial Intelligence Laboratory