Finetuning LLMs for Human Behavior Prediction in Social Science Experiments

📅 2025-09-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language models (LLMs) exhibit weak generalization and significant demographic bias when predicting individual human behavior in social science experiments. Method: We propose a supervised fine-tuning paradigm grounded in real human response data, introducing SocSci210—a novel benchmark comprising 210 controlled experiments with fine-grained demographic annotations—and conduct customized training on Qwen-based architectures. Contribution/Results: Our approach achieves dual generalization—across experiments and across experimental conditions—for the first time. The strongest variant, Socrates-Qwen-14B, improves prediction accuracy by 26% over its base model and outperforms GPT-4o by 13%. It further attains a 71% gain in zero-shot prediction on unseen experimental conditions and reduces demographic bias by 10.6%. This work establishes a reproducible, scalable empirical framework for deploying LLMs in rigorous social science research.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) offer a powerful opportunity to simulate the results of social science experiments. In this work, we demonstrate that finetuning LLMs directly on individual-level responses from past experiments meaningfully improves the accuracy of such simulations across diverse social science domains. We construct SocSci210 via an automatic pipeline, a dataset comprising 2.9 million responses from 400,491 participants in 210 open-source social science experiments. Through finetuning, we achieve multiple levels of generalization. In completely unseen studies, our strongest model, Socrates-Qwen-14B, produces predictions that are 26% more aligned with distributions of human responses to diverse outcome questions under varying conditions relative to its base model (Qwen2.5-14B), outperforming GPT-4o by 13%. By finetuning on a subset of conditions in a study, generalization to new unseen conditions is particularly robust, improving by 71%. Since SocSci210 contains rich demographic information, we reduce demographic parity, a measure of bias, by 10.6% through finetuning. Because social sciences routinely generate rich, topic-specific datasets, our findings indicate that finetuning on such data could enable more accurate simulations for experimental hypothesis screening. We release our data, models and finetuning code at stanfordhci.github.io/socrates.
Problem

Research questions and friction points this paper is trying to address.

Improving human behavior prediction accuracy in social science experiments
Enhancing generalization to unseen experimental conditions and studies
Reducing demographic bias in behavioral simulations through finetuning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Finetuning LLMs on individual-level experimental response data
Achieving generalization to unseen studies and conditions
Reducing demographic bias through dataset-driven model improvement
🔎 Similar Papers
No similar papers found.