Iterating Toward Better Search: A Two-Agent Simulation Framework for Evaluating Agentic Search Architectures in E-Commerce

📅 2026-06-11

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the lack of a controllable and reproducible evaluation framework for conversational shopping assistants. It proposes a modular dual-agent simulation approach, wherein a fixed buyer agent—equipped with predefined personality traits, task goals, and patience levels—interacts with interchangeable responder modules while invoking real e-commerce search APIs, enabling fair comparisons across diverse system architectures under identical conditions. The framework facilitates rapid failure attribution and iterative refinement, incorporating a rolling-window memory mechanism that improves query intent extraction speed by 35% and enhances overall performance metrics. It also supports swapping backbone large language models (e.g., Gemini 2.5 vs. Llama 3.3 70B) and enables systematic failure analysis. Experiments demonstrate that targeted fixes reduce failure and near-failure rates by 62%, while revealing systematic discrepancies between evaluation models (e.g., Gemini and Claude) in process correctness versus outcome orientation under identical prompts.

📝 Abstract

We present a modular two-agent simulation framework for evaluating conversational shopping assistant architectures. An independent buyer agent, configured with personas, missions, and patience levels, is paired with an interchangeable responder that integrates with a real e-commerce search API. Holding the buyer constant across experiments enables controlled comparison of responder designs on identical scenarios. Using 2011 conversations across 14 persona buckets, we establish four empirical findings. First, rolling-window memory outperforms intent-extraction memory on all quality metrics while being 35% faster per query. Second, illustrating rapid evidence-driven iteration, a systematic failure analysis of a responder version enables targeted fixes that reduce failure and near-failure rates by 62% across the full dataset. Third, swapping the responder LLM backbone from Gemini~2.5 to Llama~3.3~70B costs 0.16--0.45 points despite identical architecture. Finally, we document systematic philosophical disagreement between frontier LLM judges: Gemini rewards process correctness while Claude demands concrete outcomes, despite using the same evaluation prompt.

Problem

Research questions and friction points this paper is trying to address.

agentic search

e-commerce

conversational shopping assistant

simulation framework

LLM evaluation

Innovation

Methods, ideas, or system contributions that make the work stand out.

two-agent simulation

agentic search

rolling-window memory