ShoppingBench: A Real-World Intent-Grounded Shopping Benchmark for LLM-based Agents

📅 2025-08-06

📈 Citations: 0

✨ Influential: 0

career value

213K/year

🤖 AI Summary

Existing e-commerce benchmarks primarily evaluate simple user intents (e.g., search, purchase), failing to assess language agents’ ability to handle complex, real-world shopping goals—such as price comparison, coupon optimization, and multi-item coordinated selection. Method: We introduce ShoppingBench, the first end-to-end evaluation benchmark for complex shopping intents, featuring an interactive sandbox environment with over 2.5 million products and a scalable, intent-driven instruction generation framework. To efficiently transfer capabilities from large to small models, we propose trajectory distillation—jointly leveraging supervised fine-tuning and reinforcement learning on synthetic interaction trajectories. Contribution/Results: Experiments reveal that GPT-4.1 achieves less than 50% success rate on ShoppingBench, highlighting the benchmark’s difficulty and realism. Crucially, our distilled lightweight agent matches GPT-4.1’s performance, demonstrating effective capability transfer. ShoppingBench thus fills a critical gap in modeling and evaluating complex user objectives in e-commerce AI systems.

Technology Category

Application Category

📝 Abstract

Existing benchmarks in e-commerce primarily focus on basic user intents, such as finding or purchasing products. However, real-world users often pursue more complex goals, such as applying vouchers, managing budgets, and finding multi-products seller. To bridge this gap, we propose ShoppingBench, a novel end-to-end shopping benchmark designed to encompass increasingly challenging levels of grounded intent. Specifically, we propose a scalable framework to simulate user instructions based on various intents derived from sampled real-world products. To facilitate consistent and reliable evaluations, we provide a large-scale shopping sandbox that serves as an interactive simulated environment, incorporating over 2.5 million real-world products. Experimental results demonstrate that even state-of-the-art language agents (such as GPT-4.1) achieve absolute success rates under 50% on our benchmark tasks, highlighting the significant challenges posed by our ShoppingBench. In addition, we propose a trajectory distillation strategy and leverage supervised fine-tuning, along with reinforcement learning on synthetic trajectories, to distill the capabilities of a large language agent into a smaller one. As a result, our trained agent achieves competitive performance compared to GPT-4.1.

Problem

Research questions and friction points this paper is trying to address.

Addresses complex real-world shopping goals beyond basic intents

Evaluates LLM agents in a realistic e-commerce simulation environment

Improves small agent performance via distillation and reinforcement learning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Scalable framework for simulating user instructions

Large-scale shopping sandbox with real products

Trajectory distillation with fine-tuning and RL

🔎 Similar Papers

No similar papers found.