Mechanistic Anomaly Detection for"Quirky"Language Models

📅 2025-04-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language models (LLMs) face supervision failure risk—i.e., sensitivity to latent, unobserved factors unknown to supervisors, leading to corrupted training signals. Method: This paper proposes Mechanism Anomaly Detection (MAD), the first framework integrating mechanistic interpretability with unsupervised anomaly detection. MAD operates at the supervision-signal level by constructing a multi-feature, multi-scoring anomaly identification system based on internal LLM representations—including neuron activations, attention patterns, and intermediate-layer gradient responses—rather than output-level analysis. It jointly applies LOF, Isolation Forest, and Mahalanobis distance for robust modeling. Results: MAD achieves AUC > 0.9 across multiple “quirky” tasks, demonstrating high discriminative power in low-risk supervision settings. While cross-model generalization remains limited, MAD establishes a novel, interpretable diagnostic paradigm for high-stakes supervision.

Technology Category

Application Category

📝 Abstract
As LLMs grow in capability, the task of supervising LLMs becomes more challenging. Supervision failures can occur if LLMs are sensitive to factors that supervisors are unaware of. We investigate Mechanistic Anomaly Detection (MAD) as a technique to augment supervision of capable models; we use internal model features to identify anomalous training signals so they can be investigated or discarded. We train detectors to flag points from the test environment that differ substantially from the training environment, and experiment with a large variety of detector features and scoring rules to detect anomalies in a set of ``quirky'' language models. We find that detectors can achieve high discrimination on some tasks, but no detector is effective across all models and tasks. MAD techniques may be effective in low-stakes applications, but advances in both detection and evaluation are likely needed if they are to be used in high stakes settings.
Problem

Research questions and friction points this paper is trying to address.

Detect anomalies in quirky language models using internal features
Flag test points differing from training data for investigation
Assess effectiveness of anomaly detectors across models and tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

MAD uses internal features for anomaly detection
Detectors flag test-training environment discrepancies
Varied detector features and scoring rules tested
🔎 Similar Papers
No similar papers found.
D
David O. Johnston
Eleuther AI, Washington, DC 20010, USA
A
Arkajyoti Chakraborty
University of California Santa Cruz, California, 95054, USA
Nora Belrose
Nora Belrose
Research Lead, EleutherAI
interpretabilityneural networkstransformersnlpai