🤖 AI Summary
In conversational systems, generative query suggestion struggles to precisely capture users’ fine-grained preferences. To address this, we propose a multi-stage alignment framework: (1) initial intent alignment via prompt engineering and supervised fine-tuning; (2) subsequent preference alignment using a Gaussian Reward Model (GaRM), which explicitly represents user preferences as probabilistic distributions. GaRM incorporates two-stage reward fusion and out-of-distribution regularization to mitigate reward hacking and enhance RL stability. Our method integrates knowledge distillation, composite reward design, and click-log-driven preference modeling. Extensive evaluation—including automated metrics, human assessment, and online A/B testing—demonstrates significant improvements: a 34% relative lift in click-through rate over strong baselines. The core contribution is GaRM and its associated training paradigm—the first to explicitly model user preferences as uncertainty-aware distributions while ensuring robust policy optimization under sparse, noisy feedback.
📝 Abstract
Generative query suggestion using large language models offers a powerful way to enhance conversational systems, but aligning outputs with nuanced user preferences remains a critical challenge. To address this, we introduce a multi-stage framework designed for progressive alignment between the generation policy and user intent. Our pipeline begins with prompt engineering as a cold-start strategy, followed by the Supervised Fine-Tuning stage, in which we introduce a distillation method on click logs to create a robust foundational model. To better model user preferences while capturing their inherent uncertainty, we develop a Gaussian Reward Model (GaRM) that represents user preferences as probability distributions rather than point estimates. Finally, we employ reinforcement learning to align the generation policy with these preferences, guided by a composite reward function that integrates GaRM with auxiliary heuristics to mitigate reward hacking. To maintain training stability, this process is enhanced by a novel out-of-distribution regularization method and a two-stage reward fusion technique. Extensive experiments demonstrate that our framework significantly outperforms baselines on both automatic and human evaluations and yields a 34% relative increase in user engagement as measured by click-through rate in live A/B tests.