InfoPO: On Mutual Information Maximization for Large Language Model Alignment

📅 2025-05-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of preference-based post-training alignment for large language models (LLMs), noting that mainstream methods—such as DPO—rely on the Bradley–Terry (BT) model assumption, leading to overfitting and suboptimal performance on inference-intensive tasks. To overcome this, we propose InfoPO, the first preference optimization algorithm that explicitly maximizes mutual information between inputs and preferred responses. InfoPO decouples response probability modeling from preference modeling, eliminating the BT assumption, and incorporates a gradient-controllable likelihood constraint to ensure that the probability of preferred responses does not decrease. Unlike reward-model-dependent or online-sampling-based approaches, InfoPO requires neither a separate reward model nor online sampling, enabling efficient end-to-end fine-tuning. Extensive evaluation on benchmarks including AlpacaEval and MT-Bench demonstrates that InfoPO consistently outperforms DPO, IPPO, and other baselines, achieving an average +4.2% win rate improvement on inference-heavy tasks.

Technology Category

Application Category

📝 Abstract
We study the post-training of large language models (LLMs) with human preference data. Recently, direct preference optimization and its variants have shown considerable promise in aligning language models, eliminating the need for reward models and online sampling. Despite these benefits, these methods rely on explicit assumptions about the Bradley-Terry (BT) model, which makes them prone to overfitting and results in suboptimal performance, particularly on reasoning-heavy tasks. To address these challenges, we propose a principled preference fine-tuning algorithm called InfoPO, which effectively and efficiently aligns large language models using preference data. InfoPO eliminates the reliance on the BT model and prevents the likelihood of the chosen response from decreasing. Extensive experiments confirm that InfoPO consistently outperforms established baselines on widely used open benchmarks, particularly in reasoning tasks.
Problem

Research questions and friction points this paper is trying to address.

Optimizing LLM alignment with human preference data
Overcoming Bradley-Terry model limitations in preference fine-tuning
Improving performance in reasoning-heavy tasks via InfoPO
Innovation

Methods, ideas, or system contributions that make the work stand out.

Maximizes mutual information for LLM alignment
Eliminates reliance on Bradley-Terry model
Prevents chosen response likelihood decrease
🔎 Similar Papers
No similar papers found.
Teng Xiao
Teng Xiao
Allen Institute for AI (AI2) & University of Washington
Machine LearningReinforcement Learning
Z
Zhen Ge
Amazon
S
S. Sanghavi
T
Tian Wang
Amazon
Julian Katz-Samuels
Julian Katz-Samuels
University of Wisconsin
Machine Learning
M
Marc Versage
Amazon
Q
Qingjun Cui
Amazon
T
Trishul M. Chilimbi
Amazon