Online Agent-as-a-Judge: Situation-Generating Evaluation for Interactive Agents

📅 2026-06-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitations of existing evaluation methods that rely on passive observation and struggle to capture critical behaviors—such as conflict resolution—exhibited by large language model–driven social agents under specific triggering conditions. To overcome this, the authors propose an active evaluation framework that deploys embedded assessment agents within the environment to proactively induce scenarios aligned with social norms, thereby eliciting evaluable interaction trajectories from target agents. The framework integrates a context-generation mechanism, a dialogue-action protocol, and an evidence-driven scoring strategy, shifting evaluation from passive to active. Evaluated in a life-simulation environment encompassing 32 human-designed social norms, the approach significantly improves both coverage and alignment with human annotations, effectively uncovering latent social competencies invisible to passive assessment methods.
📝 Abstract
Evaluating LLM-powered interactive social agents is challenging because socially relevant behaviors depend not only on isolated outputs, but also on prior interactions, social roles, and downstream actions. Existing methods typically allow a target agent to act freely in an environment and then score the resulting trajectory. However, this passive setup can miss capabilities that only become observable under specific social circumstances; for example, conflict handling may remain untested if no disagreement arises. We propose Online Agent-as-a-Judge, a situation-generating evaluation framework for interactive social agents. Online Agent-as-a-Judge deploys an in-world evaluator agent that interacts with the target agent through the environment's native dialogue and action protocol, actively eliciting situations relevant to the evaluation criteria. The resulting trajectories provide evidence for assessing both immediate responses and subsequent behavior. In a life-simulation environment with $32$ designer-authored social criteria, Online Agent-as-a-Judge improves criteria coverage and agreement with human labels, yielding more reliable evidence-grounded evaluations of behaviors that passive methods can leave unobserved.
Problem

Research questions and friction points this paper is trying to address.

interactive agents
social evaluation
situation generation
behavior assessment
LLM-powered agents
Innovation

Methods, ideas, or system contributions that make the work stand out.

Online Agent-as-a-Judge
situation-generating evaluation
interactive social agents
in-world evaluator
evidence-grounded evaluation