Beyond Single-shot Writing: Deep Research Agents are Unreliable at Multi-turn Report Revision

📅 2026-01-19

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Current evaluations of Deep Research Agents (DRAs) treat report generation as a one-off task, overlooking the iterative nature of human research involving multiple rounds of feedback and revision, thereby failing to assess agent reliability in sustained interactions. This work proposes MrDre, a novel evaluation framework that, for the first time, incorporates multi-round revision as a core dimension of DRA assessment. It introduces a unified protocol for long-form report evaluation alongside a human-validated feedback simulation pipeline. Systematic evaluation across five representative DRA families reveals that 16–27% of previously covered content and citation quality degrade during revision; even the best-performing models often corrupt non-feedback regions and struggle to consistently retain prior edits. These findings indicate that existing inference-time optimization techniques are insufficient to mitigate such instability.

Technology Category

Application Category

📝 Abstract

Existing benchmarks for Deep Research Agents (DRAs) treat report generation as a single-shot writing task, which fundamentally diverges from how human researchers iteratively draft and revise reports via self-reflection or peer feedback. Whether DRAs can reliably revise reports with user feedback remains unexplored. We introduce Mr Dre, an evaluation suite that establishes multi-turn report revision as a new evaluation axis for DRAs. Mr Dre consists of (1) a unified long-form report evaluation protocol spanning comprehensiveness, factuality, and presentation, and (2) a human-verified feedback simulation pipeline for multi-turn revision. Our analysis of five diverse DRAs reveals a critical limitation: while agents can address most user feedback, they also regress on 16-27% of previously covered content and citation quality. Over multiple revision turns, even the best-performing agents leave significant headroom, as they continue to disrupt content outside the feedback's scope and fail to preserve earlier edits. We further show that these issues are not easily resolvable through inference-time fixes such as prompt engineering and a dedicated sub-agent for report revision.

Problem

Research questions and friction points this paper is trying to address.

Deep Research Agents

multi-turn report revision

report generation

user feedback

iterative drafting

Innovation

Methods, ideas, or system contributions that make the work stand out.

multi-turn report revision

Deep Research Agents

evaluation benchmark

feedback simulation

content regression

🔎 Similar Papers

No similar papers found.

Authors to Follow