🤖 AI Summary
Existing LLM agent safety evaluation methods—such as rule-based or general-LLM–based judgment—struggle to detect progressive risks, subtle semantic deviations, and multi-step cumulative hazards, further hindered by ill-defined safety criteria.
Method: We propose MemEval, a memory-augmented reasoning framework that requires no training. It enables fine-grained, stepwise safety assessment of agent behavior via structured semantic feature extraction, chain-of-thought–driven experience memory construction, and context-aware multi-stage retrieval-augmented generation (RAG).
Contribution/Results: We introduce AgentBench, the first comprehensive safety benchmark comprising 2,293 annotated instances spanning 15 risk categories and 29 scenarios, with dual strict/lenient evaluation standards. Across multiple benchmarks, MemEval achieves human-expert–level accuracy—significantly outperforming prior approaches—and establishes a new state-of-the-art for LLM-as-a-judge in LLM agent safety evaluation and deployment.
📝 Abstract
Despite the rapid advancement of LLM-based agents, the reliable evaluation of their safety and security remains a significant challenge. Existing rule-based or LLM-based evaluators often miss dangers in agents' step-by-step actions, overlook subtle meanings, fail to see how small issues compound, and get confused by unclear safety or security rules. To overcome this evaluation crisis, we introduce sys, a universal, training-free, memory-augmented reasoning framework that empowers LLM evaluators to emulate human expert evaluators. sys constructs an experiential memory by having an LLM adaptively extract structured semantic features (e.g., scenario, risk, behavior) and generate associated chain-of-thought reasoning traces for past interactions. A multi-stage, context-aware retrieval-augmented generation process then dynamically retrieves the most relevant reasoning experiences to guide the LLM evaluator's assessment of new cases. Moreover, we developed data, the first benchmark designed to check how well LLM-based evaluators can spot both safety risks and security threats. data comprises extbf{2293} meticulously annotated interaction records, covering extbf{15} risk types across extbf{29} application scenarios. A key feature of data is its nuanced approach to ambiguous risk situations, employing ``Strict'' and ``Lenient'' judgment standards. Experiments demonstrate that sys not only consistently improves the evaluation performance of LLMs across all benchmarks but also sets a new state-of-the-art in LLM-as-a-judge for agent safety and security, achieving human-level accuracy. Our work is openly openly accessible.