AutoMonitor-Bench: Evaluating the Reliability of LLM-Based Misbehavior Monitor

📅 2026-01-09

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

This work addresses the lack of systematic evaluation benchmarks for monitoring the behavior of large language models (LLMs), which hinders reliable assessment of their detection capabilities and false alarm rates across diverse tasks and failure modes. We propose AutoMonitor-Bench, the first comprehensive benchmark encompassing question answering, code generation, and reasoning tasks, featuring 3,010 annotated pairs of违规 and benign samples. We introduce a dual-metric evaluation framework based on Miss Rate (MR) and False Alarm Rate (FAR). Fine-tuning Qwen3-4B-Instruction on 153,581 training samples and evaluating across 22 mainstream LLMs reveals a pervasive MR–FAR trade-off. While fine-tuning improves recognition of known violations, it struggles to generalize to implicit, unseen violations, highlighting the inherent tension between safety and utility in building reliable LLM monitors.

Technology Category

Application Category

📝 Abstract

We introduce AutoMonitor-Bench, the first benchmark designed to systematically evaluate the reliability of LLM-based misbehavior monitors across diverse tasks and failure modes. AutoMonitor-Bench consists of 3,010 carefully annotated test samples spanning question answering, code generation, and reasoning, with paired misbehavior and benign instances. We evaluate monitors using two complementary metrics: Miss Rate (MR) and False Alarm Rate (FAR), capturing failures to detect misbehavior and oversensitivity to benign behavior, respectively. Evaluating 12 proprietary and 10 open-source LLMs, we observe substantial variability in monitoring performance and a consistent trade-off between MR and FAR, revealing an inherent safety-utility tension. To further explore the limits of monitor reliability, we construct a large-scale training corpus of 153,581 samples and fine-tune Qwen3-4B-Instruction to investigate whether training on known, relatively easy-to-construct misbehavior datasets improves monitoring performance on unseen and more implicit misbehaviors. Our results highlight the challenges of reliable, scalable misbehavior monitoring and motivate future work on task-aware designing and training strategies for LLM-based monitors.

Problem

Research questions and friction points this paper is trying to address.

misbehavior monitoring

LLM reliability

false alarm rate

miss rate

safety-utility trade-off

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM-based monitoring

misbehavior detection

benchmark