🤖 AI Summary
This work addresses the lack of evaluation benchmarks grounded in real user behavior for personalized decision modeling, a gap often filled with synthetic data that fails to capture authentic human decision-making. To bridge this, the authors introduce BehaviorBench, which leverages on-chain and prediction market public data to reconstruct decision trajectories of 2,000 real cryptocurrency wallets. The benchmark defines two core tasks—belief prediction and transaction prediction—and incorporates four distinct historical information interfaces to systematically assess personalized modeling capabilities. Experiments across 141,445 belief instances and 1,485,972 transaction instances reveal that personalization substantially improves belief prediction performance but yields limited gains for transaction prediction. Furthermore, model performance varies significantly across tasks and evaluation metrics, exposing critical limitations and failure modes in current approaches.
📝 Abstract
Many decision-support settings require systems that adapt to individual users, but evaluation data for this problem remain limited. Existing benchmarks for user understanding often rely on simulated users or model-generated behavior, even though recent work cautions that model-based simulations can diverge systematically from human behavior. We introduce \textsc{BehaviorBench}, a benchmark for evaluating personalized decision modeling from real-world behavioral traces. \textsc{BehaviorBench} reconstructs wallet-level decision histories from observed public prediction-market and on-chain records, and organizes them into two complementary task layers: \emph{Belief prediction}, which predicts a user's final revealed stance and confidence in a market, and \emph{Trade prediction}, which predicts the direction and amount of individual transactions. Across 2,000 evaluation wallets, the benchmark contains 141,445 Belief instances and 1,485,972 Trade instances, with disjoint support pools for retrieval-based evaluation. We evaluate frontier and open-weight generative models under four history interfaces: no personalization, direct recent history, generated user profiles, and retrieved support-wallet evidence. Personalization improves Belief prediction more consistently than Trade prediction, model rankings change across task layers and metrics, and different history interfaces expose different failure modes. \textsc{BehaviorBench} provides an evaluation setting for studying whether personalized methods can use real-world behavioral evidence rather than simulated users alone.