Efficient LLM Safety Evaluation through Multi-Agent Debate

📅 2025-11-09

📈 Citations: 0

✨ Influential: 0

career value

227K/year

🤖 AI Summary

Evaluating the safety of large language models (LLMs) is costly and poorly scalable. Method: This paper proposes a multi-agent debate framework powered by small language models (SLMs), comprising Critic, Defender, and Judge agents that engage in three-round structured interaction with value-aligned reasoning to detect jailbreak attack semantics at fine granularity. To support evaluation, we introduce HAJailBench—the first large-scale, human-annotated jailbreak benchmark—featuring expert-level, fine-grained annotations for assessing safety judgments and Judge reliability. Contribution/Results: Experiments show our framework achieves judgment consistency comparable to GPT-4o on HAJailBench while reducing inference cost by over 90%, significantly improving evaluation efficiency, accuracy, and scalability. This work provides the first empirical validation of lightweight multi-agent debate as an effective and robust approach for LLM safety assessment.

Technology Category

Application Category

📝 Abstract

Safety evaluation of large language models (LLMs) increasingly relies on LLM-as-a-Judge frameworks, but the high cost of frontier models limits scalability. We propose a cost-efficient multi-agent judging framework that employs Small Language Models (SLMs) through structured debates among critic, defender, and judge agents. To rigorously assess safety judgments, we construct HAJailBench, a large-scale human-annotated jailbreak benchmark comprising 12,000 adversarial interactions across diverse attack methods and target models. The dataset provides fine-grained, expert-labeled ground truth for evaluating both safety robustness and judge reliability. Our SLM-based framework achieves agreement comparable to GPT-4o judges on HAJailBench while substantially reducing inference cost. Ablation results show that three rounds of debate yield the optimal balance between accuracy and efficiency. These findings demonstrate that structured, value-aligned debate enables SLMs to capture semantic nuances of jailbreak attacks and that HAJailBench offers a reliable foundation for scalable LLM safety evaluation.

Problem

Research questions and friction points this paper is trying to address.

Reducing high costs of frontier LLMs in safety evaluation

Developing scalable jailbreak benchmark with expert annotations

Optimizing multi-agent debate structure for safety judgment accuracy

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-agent debate framework using Small Language Models

HAJailBench benchmark with 12,000 human-annotated jailbreak interactions

Three-round debate structure balancing accuracy and cost efficiency

🔎 Similar Papers

Adversarial Multi-Agent Evaluation of Large Language Models through Iterative Debates