BPO: Towards Balanced Preference Optimization between Knowledge Breadth and Depth in Alignment

📅 2024-11-16

🏛️ arXiv.org

📈 Citations: 3

✨ Influential: 0

career value

167K/year

🤖 AI Summary

In RLHF, large language models suffer from imbalanced optimization between knowledge breadth and depth, compounded by skewed prompt-response sample distributions. Method: We propose a dynamic balancing framework that (i) formally defines knowledge breadth and depth; (ii) introduces a gradient-guided dynamic depth enhancement mechanism, which evaluates sample knowledge informativeness along the optimization direction and enables differentiated depth modeling via clustering; and (iii) integrates gradient analysis, preference optimization, and knowledge-aware data augmentation. Results: Our approach achieves significant improvements in response quality across multiple alignment benchmarks (average +4.2% win rate), with training overhead increase <8%, preserving computational efficiency. It establishes a reusable, knowledge-aware data optimization paradigm and provides practical implementation guidelines for preference learning.

Technology Category

Application Category

📝 Abstract

Reinforcement Learning with Human Feedback (RLHF) is the key to the success of large language models (LLMs) in recent years. In this work, we first introduce the concepts of knowledge breadth and knowledge depth, which measure the comprehensiveness and depth of an LLM or knowledge source respectively. We reveal that the imbalance in the number of prompts and responses can lead to a potential disparity in breadth and depth learning within alignment tuning datasets by showing that even a simple uniform method for balancing the number of instructions and responses can lead to significant improvements. Building on this, we further propose Balanced Preference Optimization (BPO), designed to dynamically augment the knowledge depth of each sample. BPO is motivated by the observation that the usefulness of knowledge varies across samples, necessitating tailored learning of knowledge depth. To achieve this, we introduce gradient-based clustering, estimating the knowledge informativeness and usefulness of each augmented sample based on the model's optimization direction. Our experimental results across various benchmarks demonstrate that BPO outperforms other baseline methods in alignment tuning while maintaining training efficiency. Furthermore, we conduct a detailed analysis of each component of BPO, providing guidelines for future research in preference data optimization.

Problem

Research questions and friction points this paper is trying to address.

Address imbalance in knowledge breadth and depth

Optimize alignment tuning in reinforcement learning

Enhance LLM performance through balanced preference optimization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Balanced Preference Optimization (BPO)

gradient-based clustering

dynamic knowledge depth augmentation

🔎 Similar Papers

Comparing Bad Apples to Good Oranges: Aligning Large Language Models via Joint Preference Optimization