Inference-Time Scaling of Verification: Self-Evolving Deep Research Agents via Test-Time Rubric-Guided Verification

📅 2026-01-22

📈 Citations: 1

✨ Influential: 0

career value

220K/year

🤖 AI Summary

This work addresses the limited capacity of existing Deep Research Agents (DRAs) to self-verify and refine their reasoning during inference. We propose DeepVerifier, a plug-and-play, self-evolving paradigm that operates at test time without requiring additional training. Built upon an automatically constructed taxonomy of DRA failures—comprising five categories and thirteen subcategories—DeepVerifier employs scoring-rule-driven verifiers to iteratively correct outputs through feedback-guided refinement. To support further research, we also introduce and publicly release DeepVerifier-4K, a high-quality supervised fine-tuning dataset containing 4,646 samples. Experimental results demonstrate that DeepVerifier substantially outperforms baseline methods, achieving relative improvements of 12%–48% in meta-evaluation F1 scores and 8%–11% in accuracy on the GAIA and XBench-DeepResearch challenge subsets.

Technology Category

Application Category

📝 Abstract

Recent advances in Deep Research Agents (DRAs) are transforming automated knowledge discovery and problem-solving. While the majority of existing efforts focus on enhancing policy capabilities via post-training, we propose an alternative paradigm: self-evolving the agent's ability by iteratively verifying the policy model's outputs, guided by meticulously crafted rubrics. This approach gives rise to the inference-time scaling of verification, wherein an agent self-improves by evaluating its generated answers to produce iterative feedback and refinements. We derive the rubrics based on an automatically constructed DRA Failure Taxonomy, which systematically classifies agent failures into five major categories and thirteen sub-categories. We present DeepVerifier, a rubrics-based outcome reward verifier that leverages the asymmetry of verification and outperforms vanilla agent-as-judge and LLM judge baselines by 12%-48% in meta-evaluation F1 score. To enable practical self-evolution, DeepVerifier integrates as a plug-and-play module during test-time inference. The verifier produces detailed rubric-based feedback, which is fed back to the agent for iterative bootstrapping, refining responses without additional training. This test-time scaling delivers 8%-11% accuracy gains on challenging subsets of GAIA and XBench-DeepResearch when powered by capable closed-source LLMs. Finally, to support open-source advancement, we release DeepVerifier-4K, a curated supervised fine-tuning dataset of 4,646 high-quality agent steps focused on DRA verification. These examples emphasize reflection and self-critique, enabling open models to develop robust verification capabilities.

Problem

Research questions and friction points this paper is trying to address.

Deep Research Agents

Inference-Time Verification

Self-Evolution

Rubric-Guided Feedback

Agent Self-Improvement

Innovation

Methods, ideas, or system contributions that make the work stand out.

inference-time scaling

rubric-guided verification

Deep Research Agents