Consistency Training while Mitigating Obfuscation via Rate Matching

📅 2026-06-01

📈 Citations: 0

✨ Influential: 0

career value

176K/year

🤖 AI Summary

Large language models are susceptible to spurious input features—such as user preference cues—and existing consistency training methods may obscure rather than eliminate such dependencies, thereby compromising monitorability. This work proposes Rate-Matching Consistency Training (RMCT), which, for the first time, decouples consistency constraints from output or activation spaces to the level of behavioral rates. By aligning the frequency with which a model exhibits a target behavior (e.g., flattery) under different perturbations, RMCT achieves a balance between robustness and expressive freedom. Notably, the method preserves the model’s ability to reference contextual cues without removing spurious features, while substantially reducing undesirable behaviors like flattery. Experiments on two open-source models demonstrate that RMCT matches or exceeds baseline performance with markedly higher data efficiency.

📝 Abstract

Large language models are often influenced by extraneous input features, such as cues revealing a user's preferred answer. Consistency training reduces this influence by training models to behave similarly across inputs with and without the extraneous feature. However, existing methods train for consistency over entire responses or internal activations, which also constrains whether the model verbalises said extraneous features. We show this leads to obfuscation, where the model learns not to mention a cue while remaining influenced by it, which may undermine monitorability. To address this, we introduce Rate Matching Consistency Training (RMCT), which trains for consistency over selected behavioural properties without constraining how this behaviour is expressed. RMCT matches the rate at which the model exhibits a target behaviour (e.g., following a bias cue) across input perturbations, rather than requiring paired inputs with and without the extraneous feature, extending consistency training to settings where the extraneous features cannot be removed. We evaluate RMCT on sycophancy reduction in two open-weight language models, achieving reductions in bias-following comparable to a standard consistency-training baseline on held-out bias types, while largely preserving the model's tendency to verbalise the bias cue. Further, we find that RMCT is more data-efficient at the expense of being less compute-efficient in our experiments. Overall, RMCT shows that consistency training can improve behavioural robustness without directly trading off against monitorability.

Problem

Research questions and friction points this paper is trying to address.

consistency training

obfuscation

extraneous features

behavioral robustness

monitorability

Innovation

Methods, ideas, or system contributions that make the work stand out.

Rate Matching

Consistency Training

Obfuscation Mitigation