🤖 AI Summary
Evaluating the safety of large language models (LLMs) is costly and poorly scalable. Method: This paper proposes a multi-agent debate framework powered by small language models (SLMs), comprising Critic, Defender, and Judge agents that engage in three-round structured interaction with value-aligned reasoning to detect jailbreak attack semantics at fine granularity. To support evaluation, we introduce HAJailBench—the first large-scale, human-annotated jailbreak benchmark—featuring expert-level, fine-grained annotations for assessing safety judgments and Judge reliability. Contribution/Results: Experiments show our framework achieves judgment consistency comparable to GPT-4o on HAJailBench while reducing inference cost by over 90%, significantly improving evaluation efficiency, accuracy, and scalability. This work provides the first empirical validation of lightweight multi-agent debate as an effective and robust approach for LLM safety assessment.
📝 Abstract
Safety evaluation of large language models (LLMs) increasingly relies on LLM-as-a-Judge frameworks, but the high cost of frontier models limits scalability. We propose a cost-efficient multi-agent judging framework that employs Small Language Models (SLMs) through structured debates among critic, defender, and judge agents. To rigorously assess safety judgments, we construct HAJailBench, a large-scale human-annotated jailbreak benchmark comprising 12,000 adversarial interactions across diverse attack methods and target models. The dataset provides fine-grained, expert-labeled ground truth for evaluating both safety robustness and judge reliability. Our SLM-based framework achieves agreement comparable to GPT-4o judges on HAJailBench while substantially reducing inference cost. Ablation results show that three rounds of debate yield the optimal balance between accuracy and efficiency. These findings demonstrate that structured, value-aligned debate enables SLMs to capture semantic nuances of jailbreak attacks and that HAJailBench offers a reliable foundation for scalable LLM safety evaluation.