Consistency Training Along the Transformer Stack

📅 2026-06-04

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

This work addresses the alignment failures of large language models under emerging safety threats—including role-playing mimicry, adversarial exploits, prefilling attacks, and conditional misalignment—by introducing a multi-level consistency training mechanism within the Transformer architecture. Specifically, consistency constraints are applied at the MLP layers (MLPCT), attention heads (AttCT), and overall behavioral output (BCT), extending consistency-based alignment to these four complex threat scenarios for the first time. Experimental results demonstrate that the proposed approach substantially suppresses diverse misaligned behaviors and exhibits superior robustness and cross-threat generalization compared to existing methods tailored only to jailbreaking or sycophancy attacks. Furthermore, the study uncovers the critical role of shared residual streams in achieving effective model alignment.

📝 Abstract

Consistency training encourages models to behave similarly across different contexts, and has shown promise for reducing misalignment. We broaden the scope of consistency training in two ways. First, we introduce two new internal consistency targets: MLP Consistency Training (MLPCT), which matches post-activation MLP states, and Attention Consistency Training (AttCT), which matches per-head attention distributions. Second, we apply consistency training to four additional safety threats: persona in-context learning attacks, adversarial frustration, prefill attacks, and conditional misalignment. Across several models and threat settings, we find that consistency training reduces misalignment well beyond the sycophancy and jailbreak settings studied in prior work. We also find cases of cross-threat generalization, where training against one failure mode improves robustness to another, and identify a shared residual-stream mechanism underlying ACT, MLPCT, and AttCT, while distinguishing BCT as mechanistically distinct. Our results suggest that consistency training is a flexible and extensible framework for alignment, capable of unifying defenses against a broader class of model pathologies.

Problem

Research questions and friction points this paper is trying to address.

consistency training

model misalignment

safety threats

transformer stack

alignment

Innovation

Methods, ideas, or system contributions that make the work stand out.

Consistency Training

MLP Consistency Training

Attention Consistency Training