ShoppingBench: A Real-World Intent-Grounded Shopping Benchmark for LLM-based Agents

📅 2025-08-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing e-commerce benchmarks primarily evaluate simple user intents (e.g., search, purchase), failing to assess language agents’ ability to handle complex, real-world shopping goals—such as price comparison, coupon optimization, and multi-item coordinated selection. Method: We introduce ShoppingBench, the first end-to-end evaluation benchmark for complex shopping intents, featuring an interactive sandbox environment with over 2.5 million products and a scalable, intent-driven instruction generation framework. To efficiently transfer capabilities from large to small models, we propose trajectory distillation—jointly leveraging supervised fine-tuning and reinforcement learning on synthetic interaction trajectories. Contribution/Results: Experiments reveal that GPT-4.1 achieves less than 50% success rate on ShoppingBench, highlighting the benchmark’s difficulty and realism. Crucially, our distilled lightweight agent matches GPT-4.1’s performance, demonstrating effective capability transfer. ShoppingBench thus fills a critical gap in modeling and evaluating complex user objectives in e-commerce AI systems.

Technology Category

Application Category

📝 Abstract
Existing benchmarks in e-commerce primarily focus on basic user intents, such as finding or purchasing products. However, real-world users often pursue more complex goals, such as applying vouchers, managing budgets, and finding multi-products seller. To bridge this gap, we propose ShoppingBench, a novel end-to-end shopping benchmark designed to encompass increasingly challenging levels of grounded intent. Specifically, we propose a scalable framework to simulate user instructions based on various intents derived from sampled real-world products. To facilitate consistent and reliable evaluations, we provide a large-scale shopping sandbox that serves as an interactive simulated environment, incorporating over 2.5 million real-world products. Experimental results demonstrate that even state-of-the-art language agents (such as GPT-4.1) achieve absolute success rates under 50% on our benchmark tasks, highlighting the significant challenges posed by our ShoppingBench. In addition, we propose a trajectory distillation strategy and leverage supervised fine-tuning, along with reinforcement learning on synthetic trajectories, to distill the capabilities of a large language agent into a smaller one. As a result, our trained agent achieves competitive performance compared to GPT-4.1.
Problem

Research questions and friction points this paper is trying to address.

Addresses complex real-world shopping goals beyond basic intents
Evaluates LLM agents in a realistic e-commerce simulation environment
Improves small agent performance via distillation and reinforcement learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Scalable framework for simulating user instructions
Large-scale shopping sandbox with real products
Trajectory distillation with fine-tuning and RL
🔎 Similar Papers
No similar papers found.
J
Jiangyuan Wang
Alibaba Group
K
Kejun Xiao
Alibaba Group
Q
Qi Sun
Alibaba Group
Huaipeng Zhao
Huaipeng Zhao
Alibaba Inc
natural language processingMachine Learning
T
Tao Luo
Alibaba Group
Jiandong Zhang
Jiandong Zhang
Alibaba Group
X
Xiaoyi Zeng
Alibaba Group