🤖 AI Summary
This study addresses the limitations of current large language models (LLMs) as safety evaluators, which often exhibit rigid judgments due to strong prior biases and struggle to adapt to contextual shifts or diverse, evolving safety definitions. For the first time, this work systematically investigates the adaptability of both general-purpose and specialized LLMs under conflicting contexts or dynamically changing safety criteria. Through methods including task-specific exemplar guidance, injection of novel contextual information, and real-time adjustment of safety standards, the authors evaluate multiple LLMs across varied scenarios. The findings reveal that, despite their capacity to incorporate new information, these models consistently fail to effectively revise their judgments when their pre-existing knowledge conflicts with the current context or updated safety norms, thereby exposing a critical limitation in their adaptive reasoning for safety assessment.
📝 Abstract
LLMs-as-judges are the only way to evaluate safety at scale. Despite their importance, LLM-judges themselves are rarely evaluated beyond human agreement in simple, static benchmarks. We therefore investigate two under-explored but crucial properties of LLMs-as-judges: their susceptibility to relying on in context-information, and their steerability to differing safety definitions, which may not align with their internal safety priors. We evaluate the safety judging abilities of many generalist LLMs and safety-specific judges, and investigate the impact of task demonstrations, novel in-context information, and changing safety definitions. We find that while LLM-judges can learn from new information, they are broadly unlikely to adjust their evaluations if the context or safety definition contradicts their prior.