🤖 AI Summary
This work addresses the limitation of existing task-oriented dialogue systems, which rely on token-level optimization and struggle to align with long-horizon task objectives. The authors propose GOPO, a novel framework that decouples policy planning from response generation for the first time in task-oriented dialogue. In this approach, an expert agent optimizes multi-turn goal preferences at the dialogue trajectory level using sequence-level reward modeling, while a customer-service agent generates responses guided by the learned policy. Integrating hierarchical reinforcement learning, goal-oriented preference optimization, and large language model fine-tuning, GOPO establishes a new paradigm tailored for commercial applications. On the Mgshop dataset, it achieves a 7.7% and 10.3% improvement in Task Success Efficiency (TSE) over PPO and Memento, respectively; notably, its 14B-parameter variant outperforms both Qwen-235B and GPT-5.2 and demonstrates consistent gains across multiple datasets.
📝 Abstract
Large language models show potential in task-oriented dialogue systems, yet existing training methods often rely on token-level likelihood or preference optimization, which poorly align with long-horizon task success. To address this, we propose Goal-Oriented Preference Optimization (GOPO), a hierarchical reinforcement learning framework that decouples strategy planning from response generation via an Expert Agent and a Customer Service Agent. The Expert Agent optimizes multi-turn goal preferences at the dialogue-trajectory level, while the Customer Service Agent generates responses strictly aligned with the selected strategy. We evaluate GOPO on public benchmarks and e-commerce customer service datasets, and introduce Task-focused Sequential Engagement (TSE), a sequence-level metric derived from real e-commerce interaction data. On the Mgshop dataset, GOPO improves TSE by 7.7% and 10.3% over PPO and Memento, with consistent gains in sequence-level reward and generation quality. Furthermore, a 14B model trained with GOPO achieves 2.7% and 1.5% higher TSE than Qwen-235B and GPT-5.2, respectively. Ablation studies confirm the Expert Agent's critical role in long-horizon optimization. GOPO demonstrates consistent improvements across other datasets as well. This work establishes a new paradigm for task-oriented dialogue systems in commercial scenarios, with code and datasets to be made public.