Swap-guided Preference Learning for Personalized Reinforcement Learning from Human Feedback

📅 2026-03-12

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

This work addresses the limitations of existing reinforcement learning from human feedback (RLHF) approaches, which typically assume a single universal reward and thus struggle to capture the diversity of user preferences. It further identifies posterior collapse in variational preference learning (VPL) under sparse preference data, which renders user-specific latent variables ineffective. To resolve this issue, the paper proposes Swap-guided Preference Learning (SPL), the first method explicitly designed to mitigate posterior collapse in preference modeling. SPL introduces a novel framework leveraging “swapped annotators” and their preference mirror properties, incorporating swap-guided regularization, a preference inverse autoregressive flow (P-IAF), and an adaptive latent conditioning mechanism to effectively encourage the encoder to utilize latent variables. Experimental results demonstrate that SPL significantly alleviates posterior collapse, enhances the expressiveness of latent representations, and improves the accuracy of personalized preference prediction.

Technology Category

Application Category

📝 Abstract

Reinforcement Learning from Human Feedback (RLHF) is a widely used approach to align large-scale AI systems with human values. However, RLHF typically assumes a single, universal reward, which overlooks diverse preferences and limits personalization. Variational Preference Learning (VPL) seeks to address this by introducing user-specific latent variables. Despite its promise, we found that VPL suffers from posterior collapse. While this phenomenon is well known in VAEs, it has not previously been identified in preference learning frameworks. Under sparse preference data and with overly expressive decoders, VPL may cause latent variables to be ignored, reverting to a single-reward model. To overcome this limitation, we propose Swap-guided Preference Learning (SPL). The key idea is to construct fictitious swap annotators and use the mirroring property of their preferences to guide the encoder. SPL introduces three components: (1) swap-guided base regularization, (2) Preferential Inverse Autoregressive Flow (P-IAF), and (3) adaptive latent conditioning. Experiments show that SPL mitigates collapse, enriches user-specific latents, and improves preference prediction. Our code and data are available at https://github.com/cobang0111/SPL

Problem

Research questions and friction points this paper is trying to address.

Reinforcement Learning from Human Feedback

Personalization

Posterior Collapse

Preference Learning

Latent Variables

Innovation

Methods, ideas, or system contributions that make the work stand out.

Swap-guided Preference Learning

Posterior Collapse

Personalized RLHF