AgentAuditor: Human-Level Safety and Security Evaluation for LLM Agents

📅 2025-05-31

📈 Citations: 0

✨ Influential: 0

career value

205K/year

🤖 AI Summary

Existing LLM agent safety evaluation methods—such as rule-based or general-LLM–based judgment—struggle to detect progressive risks, subtle semantic deviations, and multi-step cumulative hazards, further hindered by ill-defined safety criteria. Method: We propose MemEval, a memory-augmented reasoning framework that requires no training. It enables fine-grained, stepwise safety assessment of agent behavior via structured semantic feature extraction, chain-of-thought–driven experience memory construction, and context-aware multi-stage retrieval-augmented generation (RAG). Contribution/Results: We introduce AgentBench, the first comprehensive safety benchmark comprising 2,293 annotated instances spanning 15 risk categories and 29 scenarios, with dual strict/lenient evaluation standards. Across multiple benchmarks, MemEval achieves human-expert–level accuracy—significantly outperforming prior approaches—and establishes a new state-of-the-art for LLM-as-a-judge in LLM agent safety evaluation and deployment.

Technology Category

Application Category

📝 Abstract

Despite the rapid advancement of LLM-based agents, the reliable evaluation of their safety and security remains a significant challenge. Existing rule-based or LLM-based evaluators often miss dangers in agents' step-by-step actions, overlook subtle meanings, fail to see how small issues compound, and get confused by unclear safety or security rules. To overcome this evaluation crisis, we introduce sys, a universal, training-free, memory-augmented reasoning framework that empowers LLM evaluators to emulate human expert evaluators. sys constructs an experiential memory by having an LLM adaptively extract structured semantic features (e.g., scenario, risk, behavior) and generate associated chain-of-thought reasoning traces for past interactions. A multi-stage, context-aware retrieval-augmented generation process then dynamically retrieves the most relevant reasoning experiences to guide the LLM evaluator's assessment of new cases. Moreover, we developed data, the first benchmark designed to check how well LLM-based evaluators can spot both safety risks and security threats. data comprises extbf{2293} meticulously annotated interaction records, covering extbf{15} risk types across extbf{29} application scenarios. A key feature of data is its nuanced approach to ambiguous risk situations, employing ``Strict'' and ``Lenient'' judgment standards. Experiments demonstrate that sys not only consistently improves the evaluation performance of LLMs across all benchmarks but also sets a new state-of-the-art in LLM-as-a-judge for agent safety and security, achieving human-level accuracy. Our work is openly openly accessible.

Problem

Research questions and friction points this paper is trying to address.

Evaluating safety and security of LLM agents effectively

Overcoming limitations of rule-based and LLM-based evaluators

Creating a benchmark for nuanced risk assessment

Innovation

Methods, ideas, or system contributions that make the work stand out.

Memory-augmented reasoning framework for LLM evaluators

Multi-stage retrieval-augmented generation process

Benchmark with nuanced ambiguous risk standards

🔎 Similar Papers

The Emerged Security and Privacy of LLM Agent: A Survey with Case Studies