Auto-Eval Judge: Towards a General Agentic Framework for Task Completion Evaluation

📅 2025-08-07

📈 Citations: 0

✨ Influential: 0

career value

201K/year

🤖 AI Summary

Current agent evaluation methods suffer from two key limitations: LLM-as-a-Judge overlooks reasoning traces, while Agent-as-a-Judge lacks cross-domain generalization. This paper introduces the first universal, modular Agent-as-a-Judge framework that jointly validates intermediate reasoning steps and final outcomes via task decomposition, reasoning trace tracking, and multi-module collaborative judgment. Inspired by human evaluation logic, the framework integrates agent outputs, contextual information, and domain-agnostic rules to enable automated, interpretable, and cross-domain task completion assessment. Experiments on GAIA and BigCodeBench demonstrate that our framework achieves 4.76% and 10.52% higher agreement with human judgments than the GPT-4o baseline, respectively. It significantly improves evaluation comprehensiveness and accuracy while maintaining modularity and generalizability across diverse domains and tasks.

Technology Category

Application Category

📝 Abstract

The increasing adoption of foundation models as agents across diverse domains necessitates a robust evaluation framework. Current methods, such as LLM-as-a-Judge, focus only on final outputs, overlooking the step-by-step reasoning that drives agentic decision-making. Meanwhile, existing Agent-as-a-Judge systems, where one agent evaluates another's task completion, are typically designed for narrow, domain-specific settings. To address this gap, we propose a generalizable, modular framework for evaluating agent task completion independent of the task domain. The framework emulates human-like evaluation by decomposing tasks into sub-tasks and validating each step using available information, such as the agent's output and reasoning. Each module contributes to a specific aspect of the evaluation process, and their outputs are aggregated to produce a final verdict on task completion. We validate our framework by evaluating the Magentic-One Actor Agent on two benchmarks, GAIA and BigCodeBench. Our Judge Agent predicts task success with closer agreement to human evaluations, achieving 4.76% and 10.52% higher alignment accuracy, respectively, compared to the GPT-4o based LLM-as-a-Judge baseline. This demonstrates the potential of our proposed general-purpose evaluation framework.

Problem

Research questions and friction points this paper is trying to address.

Evaluating agent task completion across diverse domains

Overcoming limitations of current LLM-as-a-Judge methods

Providing step-by-step reasoning validation like humans

Innovation

Methods, ideas, or system contributions that make the work stand out.

Modular framework for agent task evaluation

Step-by-step reasoning validation approach

Domain-independent evaluation process

🔎 Similar Papers

No similar papers found.