Teaching LLM to be Persuasive: Reward-Enhanced Policy Optimization for Alignment frm Heterogeneous Rewards

📅 2025-10-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address three key challenges in developing LLM-based negotiation agents for online travel agency (OTA) business—insufficient persuasive capability, weak adherence to standard operating procedures (SOPs), and difficulty balancing colloquial understanding with operational constraints (e.g., hallucination prevention and over-commitment avoidance)—this paper proposes the Reward-Enhanced Strategy Optimization (RESO) framework. RESO integrates a preference-based reward model, verifiable procedural reward functions, and a dynamic reward adjudication mechanism to enable collaborative alignment of heterogeneous reward signals and suppress reward gaming. Trained via joint supervised fine-tuning and reinforcement learning, RESO achieves an average evaluation score of 4.63 (+0.52 over SOTA) on real business development dialogues, improves high-quality response rate to 66.67%, attains a 93.33% failure-case remediation rate, and demonstrates emergent capabilities in proactive empathy and localized reasoning.

Technology Category

Application Category

📝 Abstract
We study deploying large language models (LLMs) as business development (BD) agents for persuasive price negotiation in online travel agencies (OTAs), where aligning traveler affordability and hotel profitability directly affects bookings, partner relationships, and access to travel. The agent must follow a Standard Operating Procedure (SOP) while conducting multi-turn persuasion, interpreting colloquial inputs, and adhering to guardrails (no over-promising, no hallucinations). Conventional post-training -- supervised fine-tuning (SFT) or single-source reward optimization -- overfits scripts, misses nuanced persuasive style, and fails to enforce verifiable business constraints. We propose Reward-Enhanced Policy Optimization (REPO), a reinforcement learning post-training framework that aligns an LLM with heterogeneous rewards: a preference-trained reward model (RM) for dense human alignment, a reward judge (RJ) for high-level persuasive behavior and SOP compliance, and programmatic reward functions (RF) for deterministic checks on numerics, formatting, and guardrails. A straightforward enhancement mechanism is proposed to combine the RM with RJ and RF signals to curb reward hacking and improve negotiation quality. In production-style evaluations -- approximately 150 turns from real dialogues and 225 turns from curated bad-case dialogues -- REPO lifts average dialogue rating to 4.63: +1.20 over base, +0.83 over Direct Preference Optimization (DPO); +0.33 over Group Relative Policy Optimization (GRPO), increases the share of conversations with at least one excellent response to 66.67% (+23.34 percentage points over GRPO), and achieves a 93.33% bad-case fix rate with 75.56% clean fixes, outperforming SFT, DPO, PPO, and GRPO. We also observe emergent capabilities -- proactive empathy, localized reasoning, calibrated tactics -- that surpass gold annotations.
Problem

Research questions and friction points this paper is trying to address.

Aligning LLMs for persuasive price negotiation in online travel agencies
Enforcing business constraints while maintaining nuanced persuasive style
Overcoming overfitting from conventional single-reward optimization methods
Innovation

Methods, ideas, or system contributions that make the work stand out.

Reward-Enhanced Policy Optimization combines heterogeneous rewards
Preference-trained reward model aligns with dense human feedback
Programmatic functions enforce deterministic checks on guardrails
🔎 Similar Papers
No similar papers found.
Z
Zhuoran Zhuang
Fliggy Alibaba
Y
Ye Chen
Fliggy Alibaba
X
Xia Zeng
Fliggy Alibaba
Chao Luo
Chao Luo
Shijiazhuang Tiedao University
Ground motionSoil-structure interactionSite response
L
Luhui Liu
Fliggy Alibaba
Y
Yihan Chen
Fliggy Alibaba