🤖 AI Summary
This work addresses the tension between personalization and fairness in large language models, where adapting to individual user preferences may compromise consistency and equity across social groups on objective factual tasks. To mitigate this issue, the authors propose Truth-Invariant Alignment (TIA), a novel alignment objective that preserves universal factual consistency while maintaining personalization capabilities. They introduce TriAlign, the first offline multi-agent reinforcement learning framework designed for TIA, which models distinct social groups as interacting agents and incorporates a fairness-aware optimization objective alongside an explicit inconsistency penalty. Experimental results demonstrate that TriAlign significantly reduces inter-group disparities in factual responses while simultaneously improving performance on objective tasks and retaining high-quality personalization, outperforming strong existing baselines.
📝 Abstract
Personalized large language models adapt responses to users' preferences and social attributes, but can introduce substantial universal truth inconsistencies across social groups, where some groups systematically receive less accurate responses on objective tasks. Existing alignment methods either ignore personalization or mainly focus on subjective preference alignment, largely overlooking fairness and consistency in universal truths. To address this gap, we study Truth-Invariant Alignment (TIA), an alignment problem for personalized LLMs that aims to ensure universal truths remain consistent across social groups while preserving personalization. We propose TriAlign, the first offline multi-agent reinforcement learning (MARL) framework for TIA, where each social group is modeled as an agent interacting. TriAlign jointly optimizes universal truth accuracy, cross-group truth consistency, and personalization through a fairness-aware objective and an explicit inconsistency penalty. Experiments across diverse benchmarks demonstrate that TriAlign achieves a stronger balance among these three objectives than strong baselines, reducing universal truth disparities across social groups while improving both objective task performance and personalization quality.