That's not natural: The Impact of Off-Policy Training Data on Probe Performance

📅 2025-11-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study investigates how off-policy data affects the generalization of behavioral probes for large language models (LLMs), particularly when natural instances of target behaviors—such as deception or sandbagging—are scarce. Relying on synthetic or non-target-policy-generated data to train linear or attention-based probes risks failure in real-world monitoring. The authors systematically evaluate probe performance across eight behavioral categories on multiple LLMs, revealing that both the generation policy and domain alignment of off-policy data critically influence generalization: in-domain off-policy data consistently outperforms cross-domain alternatives. Based on these findings, they propose the first predictability metric grounded in off-policy generalization, demonstrating that domain shift in training data is the primary driver of performance degradation. Notably, probes for highly covert behaviors—including deception and sandbagging—exhibit markedly poorer generalization, underscoring elevated reliability risks for safety-critical applications.

Technology Category

Application Category

📝 Abstract
Probing has emerged as a promising method for monitoring Large Language Models (LLMs), enabling inference-time detection of concerning behaviours such as deception and sycophancy. However, natural examples of many behaviours are rare, forcing researchers to rely on synthetic or off-policy LLM responses for training probes. We systematically evaluate how the use of synthetic and off-policy data influences probe generalisation across eight distinct LLM behaviours. Testing linear and attention probes across multiple LLMs, we find that the response generation strategy can significantly affect probe performance, though the magnitude of this effect varies by behaviour. We find that successful generalisation from off-policy data, to test sets where the model is incentivised to produce the target behaviour, is predictive of successful on-policy generalisation. Leveraging this result, we predict that Deception and Sandbagging probes may fail to generalise from off-policy to on-policy data when used in real monitoring scenarios. Notably, shifts in the training data domain still cause even larger performance degradation, with different-domain test scores being consistently lower than the same-domain ones. These results indicate that, in the absence of on-policy data, using same-domain off-policy data yields more reliable probes than using on-policy data from a different domain, emphasizing the need for methods that can better handle distribution shifts in LLM monitoring.
Problem

Research questions and friction points this paper is trying to address.

Evaluating how synthetic training data affects probe generalization in LLMs
Assessing probe performance shifts between off-policy and on-policy data
Investigating domain shifts impact on reliability of LLM behavior monitoring
Innovation

Methods, ideas, or system contributions that make the work stand out.

Probes use off-policy data for training
Same-domain off-policy data improves reliability
Generalization from off-policy predicts on-policy success
🔎 Similar Papers
No similar papers found.
N
Nathalie Kirch
LASR Labs
S
Samuel Dower
LASR Labs
A
Adrians Skapars
University of Manchester
Ekdeep Singh Lubana
Ekdeep Singh Lubana
Goodfire AI
AIMachine LearningDeep Learning
Dmitrii Krasheninnikov
Dmitrii Krasheninnikov
University of Cambridge