π€ AI Summary
This study addresses the performance degradation commonly observed in mixed-reward models for multi-objective alignment, where helpfulness and harmlessness objectives often conflict due to neural representation interference. Through activation analysis, targeted ablation, and comparisons between single- and mixed-objective models, this work provides the first evidence that these two objectives share critical neurons. It further demonstrates that these shared neurons predominantly drive mutual inhibition between the objectives, giving rise to alignment tension. The research identifies distinct subsets of neurons specifically supporting helpfulness or harmlessness and confirms that the shared neurons exert substantial influence on model behavior. These findings offer both a mechanistic explanation and empirical foundation for decoupling competing alignment objectives in language models.
π Abstract
Reward models are a key component of reinforcement learning from human feedback (RLHF), aligning language models toward both helpful and harmless behaviour. However, the internal mechanisms underlying these objectives and their conflicts remain poorly understood. We study alignment tension in reward models trained under helpfulness-only, harmlessness-only, and mixed-objective settings. We find that mixed-objective models often underperform single-objective models, indicating interference between objectives. Using activation-based methods, we identify neurons associated with each objective and study their functional roles via targeted ablations. We find that these neurons causally support their corresponding objectives while often negatively affecting the opposing one. We find that a substantial proportion of neurons are shared between helpfulness and harmlessness, and that these shared neurons exert a disproportionate influence on model behaviour, contributing to alignment tension. Additionally, our results provide insights and mechanistic interpretation into how alignment objectives are represented in reward models and why multi-objective alignment remains challenging, motivating future work on disentangled and controllable alignment methods.