Benchmarking and Improving Monitors for Out-Of-Distribution Alignment Failure in LLMs

📅 2026-05-20

📈 Citations: 0

✨ Influential: 0

career value

189K/year

🤖 AI Summary

This work addresses the vulnerability of large language models to alignment failures under out-of-distribution (OOD) conditions, a risk inadequately captured by existing monitoring mechanisms. To systematically evaluate OOD alignment failure detection, the authors introduce MOOD, a benchmark comprising a constrained training set and seven diverse OOD test sets. They propose a novel hybrid monitoring paradigm that integrates safety classifiers with OOD detectors based on Mahalanobis distance and perplexity. Evaluated across multiple model scales, this approach improves recall from 39% to 45% and exhibits positive scaling with model size, significantly outperforming a pure guard model even when the latter’s parameter count is increased twentyfold.

📝 Abstract

Many safety and alignment failures of large language models (LLMs) occur due to out-of-distribution (OOD) situations: unusual prompt or response patterns that are unforeseen by model developers. We systematically study whether LLM monitoring pipelines can detect these OOD alignment failures by introducing a benchmark called Misalignment Out Of Distribution (MOOD). It is difficult to find failures that are truly OOD for off-the-shelf models trained on vast safety datasets. We sidestep this by including a restricted training set in MOOD that we use to train our own monitors, as well as seven test sets with diverse alignment failures that are outside the training distribution. Using MOOD, we find that guard models (safety classifiers) often fail to generalize OOD. To fix this, we propose combining guard models with OOD detectors. We test four types of OOD detectors and find that a combination of a guard model with Mahalanobis distance and perplexity-based OOD detectors can improve recall from 39% to 45%. We also establish positive scaling trends across model scales for monitors that combine a guard model and OOD detector; we find that incorporating OOD detection into monitoring achieves a higher recall gain than using a guard model with 20 times more parameters. Our work suggests that OOD detection should be a crucial component of LLM monitoring and provides a foundation for further work on this important problem.

Problem

Research questions and friction points this paper is trying to address.

out-of-distribution

alignment failure

large language models

monitoring

safety

Innovation

Methods, ideas, or system contributions that make the work stand out.

out-of-distribution detection

LLM alignment

guard models