InfoPO: On Mutual Information Maximization for Large Language Model Alignment

📅 2025-05-13

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the challenge of preference-based post-training alignment for large language models (LLMs), noting that mainstream methods—such as DPO—rely on the Bradley–Terry (BT) model assumption, leading to overfitting and suboptimal performance on inference-intensive tasks. To overcome this, we propose InfoPO, the first preference optimization algorithm that explicitly maximizes mutual information between inputs and preferred responses. InfoPO decouples response probability modeling from preference modeling, eliminating the BT assumption, and incorporates a gradient-controllable likelihood constraint to ensure that the probability of preferred responses does not decrease. Unlike reward-model-dependent or online-sampling-based approaches, InfoPO requires neither a separate reward model nor online sampling, enabling efficient end-to-end fine-tuning. Extensive evaluation on benchmarks including AlpacaEval and MT-Bench demonstrates that InfoPO consistently outperforms DPO, IPPO, and other baselines, achieving an average +4.2% win rate improvement on inference-heavy tasks.

Technology Category

Application Category

📝 Abstract

We study the post-training of large language models (LLMs) with human preference data. Recently, direct preference optimization and its variants have shown considerable promise in aligning language models, eliminating the need for reward models and online sampling. Despite these benefits, these methods rely on explicit assumptions about the Bradley-Terry (BT) model, which makes them prone to overfitting and results in suboptimal performance, particularly on reasoning-heavy tasks. To address these challenges, we propose a principled preference fine-tuning algorithm called InfoPO, which effectively and efficiently aligns large language models using preference data. InfoPO eliminates the reliance on the BT model and prevents the likelihood of the chosen response from decreasing. Extensive experiments confirm that InfoPO consistently outperforms established baselines on widely used open benchmarks, particularly in reasoning tasks.

Problem

Research questions and friction points this paper is trying to address.

Optimizing LLM alignment with human preference data

Overcoming Bradley-Terry model limitations in preference fine-tuning

Improving performance in reasoning-heavy tasks via InfoPO

Innovation

Methods, ideas, or system contributions that make the work stand out.

Maximizes mutual information for LLM alignment

Eliminates reliance on Bradley-Terry model

Prevents chosen response likelihood decrease

🔎 Similar Papers

No similar papers found.

Authors to Follow