From Clicks to Preference: A Multi-stage Alignment Framework for Generative Query Suggestion in Conversational System

📅 2025-08-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In conversational systems, generative query suggestion struggles to precisely capture users’ fine-grained preferences. To address this, we propose a multi-stage alignment framework: (1) initial intent alignment via prompt engineering and supervised fine-tuning; (2) subsequent preference alignment using a Gaussian Reward Model (GaRM), which explicitly represents user preferences as probabilistic distributions. GaRM incorporates two-stage reward fusion and out-of-distribution regularization to mitigate reward hacking and enhance RL stability. Our method integrates knowledge distillation, composite reward design, and click-log-driven preference modeling. Extensive evaluation—including automated metrics, human assessment, and online A/B testing—demonstrates significant improvements: a 34% relative lift in click-through rate over strong baselines. The core contribution is GaRM and its associated training paradigm—the first to explicitly model user preferences as uncertainty-aware distributions while ensuring robust policy optimization under sparse, noisy feedback.

Technology Category

Application Category

📝 Abstract
Generative query suggestion using large language models offers a powerful way to enhance conversational systems, but aligning outputs with nuanced user preferences remains a critical challenge. To address this, we introduce a multi-stage framework designed for progressive alignment between the generation policy and user intent. Our pipeline begins with prompt engineering as a cold-start strategy, followed by the Supervised Fine-Tuning stage, in which we introduce a distillation method on click logs to create a robust foundational model. To better model user preferences while capturing their inherent uncertainty, we develop a Gaussian Reward Model (GaRM) that represents user preferences as probability distributions rather than point estimates. Finally, we employ reinforcement learning to align the generation policy with these preferences, guided by a composite reward function that integrates GaRM with auxiliary heuristics to mitigate reward hacking. To maintain training stability, this process is enhanced by a novel out-of-distribution regularization method and a two-stage reward fusion technique. Extensive experiments demonstrate that our framework significantly outperforms baselines on both automatic and human evaluations and yields a 34% relative increase in user engagement as measured by click-through rate in live A/B tests.
Problem

Research questions and friction points this paper is trying to address.

Aligning generative query suggestions with nuanced user preferences
Modeling user preference uncertainty in conversational systems
Mitigating reward hacking while maintaining training stability
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-stage alignment framework for query suggestion
Gaussian Reward Model for preference uncertainty
Reinforcement learning with composite reward function
🔎 Similar Papers
No similar papers found.