LLM Judges Inconsistently Disagree Across Safety Criteria and Harm Categories

📅 2026-05-29

📈 Citations: 0

✨ Influential: 0

career value

187K/year

🤖 AI Summary

This study addresses the significant inconsistency of large language models (LLMs) when employed as automated evaluators in multidimensional safety assessments, particularly their limited reliability in detecting implicitly harmful advice in regulated domains such as finance. The authors propose a reference-free, cross-lingual, cross-domain, and multi-genre automatic evaluation framework to systematically assess model judgment consistency across diverse safety criteria and harm categories. Their findings reveal that while LLMs reliably identify explicit harms—such as those involving violence—they exhibit notable fragility in safety judgments within specialized professional contexts and demonstrate substantial inter-model disagreement. These results expose critical limitations in current automated evaluation mechanisms and inform targeted practical recommendations for their responsible deployment.

📝 Abstract

We evaluate the consistency of automated judges in conducting a multi-dimensional safety evaluation in a reference-free setup. Our results indicate that Large Language Models are unreliable judges in identifying safety issues related to machine-generated advice in regulated domains such as finance, although they are more reliable at identifying more overt forms of unsafe/harmful content such as violence. The degree of inconsistency in a model's judgments can vary significantly by the chosen safety criteria and can be impacted by the language of the content and its linguistic style as well. Finally, there is high disagreement among different judges for the same output, across domains, safety criteria, and languages. These findings provide new insights on the practice of using LLMs as evaluators and offer several recommendations for practitioners on how to use automated judges in practical scenarios.

Problem

Research questions and friction points this paper is trying to address.

LLM judges

safety evaluation

inconsistency

harm categories

automated evaluation

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM judges

safety evaluation

consistency