SimPersona: Learning Discrete Buyer Personas from Raw Clickstreams for Grounded E-Commerce Agents

📅 2026-05-13
📈 Citations: 0
Influential: 0
📄 PDF

career value

194K/year
🤖 AI Summary
Existing e-commerce agents struggle to capture the heterogeneity of real-world buyers, often collapsing into a single “average buyer” strategy and relying on manually crafted persona prompts that are brittle and inefficient. This work proposes a method to automatically learn interpretable, discrete buyer types directly from raw clickstream data, encoding them via a behavior-aware vector-quantized variational autoencoder (VQ-VAE) into compact persona tokens residing in the vocabulary of a large language model. These tokens are then integrated with real browsing trajectories to fine-tune agent behavior. The approach enables personalized persona assignment without retraining and faithfully reconstructs merchant-specific buyer distributions. Evaluated on data from 42 live stores and 8.37 million buyers, the method achieves 78% alignment with real buyer conversion rates and outperforms a baseline model with eight times more parameters in goal-oriented tasks. The full pipeline—from clickstream processing to agent training—is publicly released.
📝 Abstract
LLM-based web agents can navigate live storefronts, yet they often collapse to a single "average buyer" policy, failing to capture the heterogeneous and distributional nature of real buyer populations. Existing personalization methods rely on hand-crafted prompt-based personas that are brittle, difficult to scale, context-inefficient, and unable to faithfully represent population-level behavior. We introduce SimPersona, a novel framework that learns discrete buyer types from historical traffic and exposes them to LLM-based web agents as compact persona tokens. Given raw clickstreams, a behavior-aware VQ-VAE induces a discrete buyer-type space that captures the statistical structure of real buyer behavior and merchant-specific buyer population distributions. To provide behavior-specific guidance to LLM-based web agents, SimPersona maps each learned buyer type to a dedicated persona token in the LLM agent vocabulary and fine-tunes the agent with these tokens on real browsing traces. At inference, each synthetic buyer is assigned to a learned buyer type with a single encoder forward pass, requiring no retraining or store-specific prompt engineering. For population-level simulation, SimPersona samples buyer types from each merchant's empirical distribution over the learned VQ-VAE codebook and instantiates agents with the corresponding persona tokens, preserving merchant-specific buyer population distributions. Evaluated on $8.37$M buyers across $42$ held-out live storefronts, SimPersona achieves $78\%$ conversion-rate alignment with real buyers, exhibits interpretable behavioral variation across buyer types, and outperforms a baseline with $8\times$ more parameters on goal-oriented shopping tasks. We further release an open-source data pipeline that converts raw e-commerce event logs into buyer representations and agent-training traces.
Problem

Research questions and friction points this paper is trying to address.

buyer personas
clickstream data
LLM-based agents
behavior heterogeneity
e-commerce personalization
Innovation

Methods, ideas, or system contributions that make the work stand out.

SimPersona
discrete buyer personas
VQ-VAE
LLM-based web agents
clickstream learning
🔎 Similar Papers
No similar papers found.