🤖 AI Summary
Existing agent evaluation frameworks overlook experience-driven adaptive learning and reasoning capabilities in dynamic environments, particularly in multi-turn natural language dialogue for product recommendation.
Method: We introduce BELA, the first benchmark for context-aware experiential learning, integrating real-world Amazon product data, structured user profiles, and a large language model–driven user simulator to systematically assess agents’ active exploration, continual learning, and adaptive decision-making.
Contribution/Results: Experiments reveal that state-of-the-art large language models fail to improve performance across dialogue turns, exposing fundamental limitations in contextual accumulation, preference evolution modeling, and policy iteration. BELA establishes a reproducible, scalable paradigm for evaluating and advancing long-term agent adaptability in interactive, evolving settings.
📝 Abstract
To reliably navigate ever-shifting real-world environments, agents must grapple with incomplete knowledge and adapt their behavior through experience. However, current evaluations largely focus on tasks that leave no ambiguity, and do not measure agents' ability to adaptively learn and reason through the experiences they accrued. We exemplify the need for this in-context experiential learning in a product recommendation context, where agents must navigate shifting customer preferences and product landscapes through natural language dialogue. We curate a benchmark for experiential learning and active exploration (BELA) that combines (1) rich real-world products from Amazon, (2) a diverse collection of user personas to represent heterogeneous yet latent preferences, and (3) a LLM user simulator powered by the persona to create rich interactive trajectories. We observe that current frontier models struggle to meaningfully improve across episodes, underscoring the need for agentic systems with strong in-context learning capabilities.