COMPASS: A Multi-Turn Benchmark for Tool-Mediated Planning & Preference Optimization

๐Ÿ“… 2025-10-08
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Existing evaluations lack rigorous assessment of large language model (LLM) agentsโ€™ ability to jointly perform multi-turn tool orchestration and user preference optimization for complex constrained planning tasks. Method: We construct a realistic tool ecosystem comprising verified transportation, accommodation, and ticketing databases covering 20 U.S. National Parks, integrated with a simulated commercial booking platform. We propose the first multi-turn constrained optimization and preference coordination benchmark tailored to travel planning, unifying dialogue modeling, tool invocation, hard-constraint satisfaction, and soft-preference optimization. Results: Experiments reveal that mainstream LLMs reliably satisfy hard constraints but exhibit significant limitations in soft-preference optimization (e.g., timeโ€“cost trade-offs) and cross-service coordinated planning; open-source models underperform further. Our framework quantifies, for the first time, the systematic gap between feasible solutions and Pareto-optimal solutions, establishing a novel benchmark for evaluating planning robustness in LLM agents.

Technology Category

Application Category

๐Ÿ“ Abstract
Real-world large language model (LLM) agents must master strategic tool use and user preference optimization through multi-turn interactions to assist users with complex planning tasks. We introduce COMPASS (Constrained Optimization through Multi-turn Planning and Strategic Solutions), a benchmark that evaluates agents on realistic travel-planning scenarios. We cast travel planning as a constrained preference optimization problem, where agents must satisfy hard constraints while simultaneously optimizing soft user preferences. To support this, we build a realistic travel database covering transportation, accommodation, and ticketing for 20 U.S. National Parks, along with a comprehensive tool ecosystem that mirrors commercial booking platforms. Evaluating state-of-the-art models, we uncover two critical gaps: (i) an acceptable-optimal gap, where agents reliably meet constraints but fail to optimize preferences, and (ii) a plan-coordination gap, where performance collapses on multi-service (flight and hotel) coordination tasks, especially for open-source models. By grounding reasoning and planning in a practical, user-facing domain, COMPASS provides a benchmark that directly measures an agent's ability to optimize user preferences in realistic tasks, bridging theoretical advances with real-world impact.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLM agents' multi-turn tool use for complex planning tasks
Optimizing user preferences while satisfying hard constraints in travel planning
Identifying performance gaps in preference optimization and multi-service coordination
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-turn tool-mediated planning benchmark
Constrained preference optimization problem formulation
Realistic travel database and tool ecosystem
๐Ÿ”Ž Similar Papers
No similar papers found.