DailyReport: An Open-ended Benchmark for Evaluating Search Agents on Daily Search Tasks

📅 2026-06-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing evaluation benchmarks for search agents predominantly focus on specialized tasks and rely on coarse-grained scoring, which inadequately captures real-world user needs. This work proposes the first open-ended evaluation benchmark tailored to everyday search tasks, introducing a framework that decomposes tasks, decouples performance across multiple dimensions, and employs a cascaded scoring mechanism integrated with aggregated user preferences to enable fine-grained and interpretable agent assessment. Evaluation of 17 state-of-the-art systems under this framework reveals substantial gaps in their ability to meet everyday search expectations. The study also releases an open-source dataset comprising 150 tasks and 3,546 scoring rules, along with the accompanying codebase.
📝 Abstract
Search Agents (SAs) typically leverage large language models (LLMs) to support complex information-seeking tasks by autonomously exploring web sources and synthesizing information into comprehensive responses. For SAs evaluation, prior benchmarks mainly focus on specialized tasks that are unlikely to arise in real-world user scenarios. Moreover, their reliance on coarse task-level rubrics often limits evaluation interpretability. To bridge this gap, we introduce DailyReport, an open-ended benchmark to evaluate SA capabilities on daily search tasks. It contains 150 open-ended tasks with 3,546 associated rubrics, capturing widely discussed and timely information demands of real-world users. Each task is decomposed into subtasks and evaluated with cascade rubrics across disentangled dimensions. Through cascade performance attribution and user-centric aggregation, we derive highly interpretable scores for each dimension, along with a user preference score. Our results on 17 agentic systems show that current systems still fall short of users' expectations. To facilitate future research, our dataset and code are made publicly available at https://github.com/AGI-Eval-Official/DailyReport.
Problem

Research questions and friction points this paper is trying to address.

Search Agents
evaluation benchmark
daily search tasks
interpretability
real-world user scenarios
Innovation

Methods, ideas, or system contributions that make the work stand out.

open-ended benchmark
search agents
cascade rubrics
user-centric evaluation
interpretable scoring
🔎 Similar Papers
No similar papers found.