Take Caution in Using LLMs as Human Surrogates: Scylla Ex Machina

📅 2024-10-25
🏛️ arXiv.org
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
This study investigates whether large language models (LLMs) can reliably serve as behavioral decision agents, focusing on the canonical 11–20 money request game—a well-established paradigm in experimental economics for studying strategic reasoning and social preferences. Method: We conduct a systematic, multi-model comparative evaluation (GPT-4, Claude, Llama, etc.) under standardized game protocols, employing two-sample Kolmogorov–Smirnov tests and ablation analyses to quantify distributional divergence from human behavior. Contribution/Results: We provide the first rigorous evidence that all mainstream LLMs significantly deviate from human decision distributions (p < 0.001), exhibiting three distinct sources of unpredictability: linguistic dependency, role sensitivity, and interference from safety-aligned mechanisms. These findings expose fundamental limitations of LLMs in behavioral modeling and challenge the implicit assumption that they constitute valid human proxies. Furthermore, we introduce a reproducible, protocol-driven evaluation framework—establishing both methodological foundations and critical behavioral benchmarks for future human–AI behavioral alignment research.

Technology Category

Application Category

📝 Abstract
Recent studies suggest large language models (LLMs) can exhibit human-like reasoning, aligning with human behavior in economic experiments, surveys, and political discourse. This has led many to propose that LLMs can be used as surrogates or simulations for humans in social science research. However, LLMs differ fundamentally from humans, relying on probabilistic patterns, absent the embodied experiences or survival objectives that shape human cognition. We assess the reasoning depth of LLMs using the 11-20 money request game. Nearly all advanced approaches fail to replicate human behavior distributions across many models. Causes of failure are diverse and unpredictable, relating to input language, roles, and safeguarding. These results advise caution when using LLMs to study human behavior or as surrogates or simulations.
Problem

Research questions and friction points this paper is trying to address.

Large Language Models
Human Behavior Simulation
Reliability in Human Studies
Innovation

Methods, ideas, or system contributions that make the work stand out.

Large Language Models
Human Decision-making Simulation
AI in Behavioral Research
🔎 Similar Papers
Y
Yuan Gao
Questrom School of Business, Information Systems Department, Boston University
D
Dokyun Lee
Questrom School of Business, Information Systems Department and Computing & Data Sciences, Boston University
Gordon Burtch
Gordon Burtch
Allen and Kelli Questrom Professor in Information Systems, Boston University // @gburtch
User Generated ContentDigital PlatformsCrowdsTech and SocietyCausal Inference
Sina Fazelpour
Sina Fazelpour
Northeastern University
Philosophy of ScienceEthics of Artificial IntelligenceCognitive ModelingSocial Epistemology