RLHF Fine-Tuning of LLMs for Alignment with Implicit User Feedback in Conversational Recommenders

📅 2025-08-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current LLM-based conversational recommender systems (CRS) suffer from insufficient preference alignment due to their inability to effectively model implicit user feedback—such as dwell time and sentiment polarity. To address this, we propose a Reinforcement Learning from Human Feedback (RLHF) framework leveraging weakly labeled interaction data. Our method introduces a multi-source implicit-signal-integrated reward model $R_phi$, jointly optimized with dialogue state transition modeling via Proximal Policy Optimization (PPO) in an end-to-end manner. Crucially, we incorporate fine-grained behavioral signals into reward modeling—enabling continuous, annotation-free preference alignment. Experiments on real-world benchmarks (REDIAL, OpenDialKG) and synthetic scenarios demonstrate significant improvements in top-$k$ recommendation accuracy, dialogue coherence, and user satisfaction. This work establishes a novel, scalable, and adaptive paradigm for CRS that bridges implicit behavioral cues with preference learning without reliance on costly explicit annotations.

Technology Category

Application Category

📝 Abstract
Conversational recommender systems (CRS) based on Large Language Models (LLMs) need to constantly be aligned to the user preferences to provide satisfying and context-relevant item recommendations. The traditional supervised fine-tuning cannot capture the implicit feedback signal, e.g., dwell time, sentiment polarity, or engagement patterns. In this paper, we share a fine-tuning solution using human feedback reinforcement learning (RLHF) to maximize implied user feedback (IUF) in a multi-turn recommendation context. We specify a reward model $R_φ$ learnt on weakly-labelled engagement information and maximize user-centric utility by optimizing the foundational LLM M_θ through a proximal policy optimization (PPO) approach. The architecture models conversational state transitions $s_t o a_t o s_{t +1}$, where the action $a_t$ is associated with LLM-generated item suggestions only on condition of conversation history in the past. The evaluation across synthetic and real-world datasets (e.g.REDIAL, OpenDialKG) demonstrates that our RLHF-fine-tuned models can perform better in terms of top-$k$ recommendation accuracy, coherence, and user satisfaction compared to (arrow-zero-cmwrquca-teja-falset ensuite 2Round group-deca States penalty give up This paper shows that implicit signal alignment can be efficient in achieving scalable and user-adaptive design of CRS.
Problem

Research questions and friction points this paper is trying to address.

Align LLMs with implicit user feedback in conversational recommenders
Capture implicit signals like dwell time and engagement patterns
Optimize LLM using RLHF for better recommendation accuracy and satisfaction
Innovation

Methods, ideas, or system contributions that make the work stand out.

RLHF fine-tuning for implicit user feedback alignment
Reward model learning from weakly-labelled engagement data
PPO-optimized LLM for multi-turn recommendation accuracy
🔎 Similar Papers
No similar papers found.
Z
Zhongheng Yang
Khoury College of Computer Sciences, Northeastern University, Jersey City, NJ, USA
Y
Yinuo Yang
McCormick School of Engineering, Northwestern University, Evanston, IL, USA
A
Aijia Sun
Khoury College of Computer Sciences, Northeastern University, Seattle, WA, USA
D
Dannier Li
School of Computing, University of Nebraska-Lincoln, Lincoln, NE, USA
Yushang Zhao
Yushang Zhao
Washington University in St. Louis
Artificial IntelligenceNLPLLMRecommendationDigital Marketing
Chengrui Zhou
Chengrui Zhou
Columbia University