Clone-Robust AI Alignment

📅 2025-01-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Reward models in Reinforcement Learning from Human Feedback (RLHF) suffer from poor robustness due to imbalanced candidate answer distributions and the presence of semantically near-duplicate responses (“clones”), leading to biased preference estimation. Method: This paper introduces the first formal definition and theoretical guarantee of clone robustness for RLHF. We propose a Weighted Maximum Likelihood Estimation (Weighted MLE) framework that integrates social choice theory with semantic similarity–based weighting, preserving statistical consistency while eliminating clone sensitivity. Contribution/Results: We prove that the proposed method strictly satisfies the clone robustness axiom. Empirical evaluation demonstrates substantial improvements in reward model stability and generalization under imbalanced data, particularly mitigating bias induced by redundant or semantically overlapping answers. The approach advances the theoretical foundations of preference learning and enhances practical reliability in real-world RLHF deployments.

Technology Category

Application Category

📝 Abstract
A key challenge in training Large Language Models (LLMs) is properly aligning them with human preferences. Reinforcement Learning with Human Feedback (RLHF) uses pairwise comparisons from human annotators to train reward functions and has emerged as a popular alignment method. However, input datasets in RLHF are not necessarily balanced in the types of questions and answers that are included. Therefore, we want RLHF algorithms to perform well even when the set of alternatives is not uniformly distributed. Drawing on insights from social choice theory, we introduce robustness to approximate clones, a desirable property of RLHF algorithms which requires that adding near-duplicate alternatives does not significantly change the learned reward function. We first demonstrate that the standard RLHF algorithm based on regularized maximum likelihood estimation (MLE) fails to satisfy this property. We then propose the weighted MLE, a new RLHF algorithm that modifies the standard regularized MLE by weighting alternatives based on their similarity to other alternatives. This new algorithm guarantees robustness to approximate clones while preserving desirable theoretical properties.
Problem

Research questions and friction points this paper is trying to address.

Imbalanced Data
Language Models
Stability
Innovation

Methods, ideas, or system contributions that make the work stand out.

Improved RLHF Algorithm
Stability in Handling Repetitive Answers
Balanced Data Optimization
🔎 Similar Papers
A
Ariel D. Procaccia
Paulson School of Engineering and Applied Sciences, Harvard University
B
Benjamin Schiffer
Department of Statistics, Harvard University
Shirley Zhang
Shirley Zhang
University of Wisconsin, Madison