Anatomy of Post-Training: Using Interpretability to Characterize Data and Shape the Learning Signal

📅 2026-06-10

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Current post-training of language models relies on abstract scalar rewards, which lack transparency regarding the instructional content of preference data and can lead models to learn spurious correlations, resulting in undesirable behaviors such as excessive stylization or sycophancy. This work proposes a data-centric post-training framework that, for the first time, leverages interpretability methods to explicitly model latent conceptual signals within preference data. By analyzing and identifying key features that distinguish preferred from non-preferred responses prior to optimization, the approach integrates interpretability protocols, statistical hypothesis testing, and fine-grained interventions at both feature and data levels. This enables effective diagnosis and suppression of harmful learning signals, significantly reducing off-target behaviors across multiple benchmarks while enhancing model safety and controllable personality.

📝 Abstract

Language-model post-training is the main stage at which model behavior is shaped, yet it still largely involves optimization of scalar rewards that summarize diverse desiderata. This abstraction gives practitioners little visibility into what their data actually teaches models, allowing spurious correlations to be learned by a model and inducing undesirable behaviors such as over-stylization and sycophancy. To address this problem, we ask: can we inspect a preference dataset before optimization and decide, at the level of concepts, which behaviors a model should be allowed to learn? Motivated by this, we introduce a data-centric post-training pipeline that uses interpretability protocols to develop statistical hypotheses for the latent concepts separating preferred from dispreferred generations, making them explicit for fine-grained user feedback. Building on this view, we unify several interpretability-based training protocols as ways of shaping rewards via feature or data interventions. Empirically, we show that our pipeline diagnoses undesirable signals in existing preference data, mitigates off-target learning, and can also help amplify or shape desired properties such as safeguards and model personality. More broadly, our results suggest that interpretability can turn post-training from optimizing opaque proxy rewards into a process of auditing and sculpting the learning signal itself.

Problem

Research questions and friction points this paper is trying to address.

post-training

interpretability

preference dataset

learning signal

spurious correlations

Innovation

Methods, ideas, or system contributions that make the work stand out.

post-training

interpretability

preference dataset