🤖 AI Summary
Large language models (LLMs) face challenges in task-oriented dialogue (TOD), particularly in multi-step decision-making and effective external knowledge invocation.
Method: This work introduces the ReAct paradigm to TOD for the first time, systematically integrating reasoning–action co-prompting, simulated environment evaluation, real-user testing, and multi-dimensional human evaluation.
Contribution/Results: ReAct-enhanced LLMs significantly improve subjective user satisfaction, dialogue naturalness, and robustness in real-world interactions. Although task completion rates remain slightly below current state-of-the-art (SOTA) systems, our findings reveal a fundamental trade-off between objective success metrics and holistic user experience. This challenges purely metric-driven evaluation paradigms and establishes a human-centered framework for both model design and assessment in TOD—paving the way for more usable, trustworthy, and user-aligned conversational agents.
📝 Abstract
Large language models (LLMs) gained immense popularity due to their impressive capabilities in unstructured conversations. However, they underperform compared to previous approaches in task-oriented dialogue (TOD), wherein reasoning and accessing external information are crucial. Empowering LLMs with advanced prompting strategies such as reasoning and acting (ReAct) has shown promise in solving complex tasks traditionally requiring reinforcement learning. In this work, we apply the ReAct strategy to guide LLMs performing TOD. We evaluate ReAct-based LLMs (ReAct-LLMs) both in simulation and with real users. While ReAct-LLMs seem to underperform state-of-the-art approaches in simulation, human evaluation indicates higher user satisfaction rate compared to handcrafted systems despite having a lower success rate.