🤖 AI Summary
This work addresses the limitations of existing reinforcement learning from human feedback (RLHF) approaches, which typically assume a single universal reward and thus struggle to capture the diversity of user preferences. It further identifies posterior collapse in variational preference learning (VPL) under sparse preference data, which renders user-specific latent variables ineffective. To resolve this issue, the paper proposes Swap-guided Preference Learning (SPL), the first method explicitly designed to mitigate posterior collapse in preference modeling. SPL introduces a novel framework leveraging “swapped annotators” and their preference mirror properties, incorporating swap-guided regularization, a preference inverse autoregressive flow (P-IAF), and an adaptive latent conditioning mechanism to effectively encourage the encoder to utilize latent variables. Experimental results demonstrate that SPL significantly alleviates posterior collapse, enhances the expressiveness of latent representations, and improves the accuracy of personalized preference prediction.
📝 Abstract
Reinforcement Learning from Human Feedback (RLHF) is a widely used approach to align large-scale AI systems with human values. However, RLHF typically assumes a single, universal reward, which overlooks diverse preferences and limits personalization. Variational Preference Learning (VPL) seeks to address this by introducing user-specific latent variables. Despite its promise, we found that VPL suffers from posterior collapse. While this phenomenon is well known in VAEs, it has not previously been identified in preference learning frameworks. Under sparse preference data and with overly expressive decoders, VPL may cause latent variables to be ignored, reverting to a single-reward model. To overcome this limitation, we propose Swap-guided Preference Learning (SPL). The key idea is to construct fictitious swap annotators and use the mirroring property of their preferences to guide the encoder. SPL introduces three components: (1) swap-guided base regularization, (2) Preferential Inverse Autoregressive Flow (P-IAF), and (3) adaptive latent conditioning. Experiments show that SPL mitigates collapse, enriches user-specific latents, and improves preference prediction. Our code and data are available at https://github.com/cobang0111/SPL