Synthetic Personalities: How Well Can LLMs Mimic Individual Respondents Using Socio-Economic Microdata?

📅 2026-06-03
📈 Citations: 0
Influential: 0
📄 PDF

career value

221K/year
🤖 AI Summary
This study investigates the feasibility of constructing high-fidelity, individual-level digital twins from heterogeneous enterprise panel data—such as CRM records, loyalty program interactions, and repeated survey responses—to enable scalable market research. Leveraging the German Socio-Economic Panel (SOEP), the authors systematically evaluate three open-source large language models (LLMs) across varying levels of information depth, embedding strategies (narrative summaries versus raw dialogue histories), and reasoning modes (explicit versus default), analyzing over 2.1 million synthetic responses. The work demonstrates, for the first time in a real-world operational setting, that LLM-based digital twins can be effectively built from existing panel data. Key findings include diminishing returns beyond the 75th percentile of information entropy and significantly improved performance when using raw dialogue history embeddings. Under optimal configurations, synthetic responses achieve 78.8% accuracy and a Fisher-z transformed correlation of r = 0.590.
📝 Abstract
LLM-based digital twins promise to scale and accelerate market research, but most published twins are either coarse persona bots conditioned on a few demographic questions or detailed individual-level twins built on purpose-collected surveys and interview transcripts. Neither setup speaks to the operationally most relevant case for marketing practice: building detailed individual twins from the pre-existing heterogeneous panel data that firms already accumulate through CRM systems, loyalty programs, and repeat surveys. We construct detailed individual-level twins from the German Socio-Economic Panel (SOEP) and evaluate them across a $3 \times 5 \times 2 \times 2$ construction-method grid that covers three open-weights LLMs, five cumulative information depths ranked by normalized Shannon entropy, two embedding methods, and two reasoning modes, scoring over 2.1 million twin responses on 500 participants and 183 held-out questions. Twin quality rises with information depth but with diminishing returns past the 75 percent entropy quartile, which acts as a cost-efficient Pareto point relative to the best-performing 100 percent cells. Switching the embedding from a narrative persona summary to a raw dialog history of past responses raises hold-out accuracy in every model-by-reasoning cell at the 100 percent depth, while an explicit thinking mode raises rank-order correlation without moving accuracy. Best-cell accuracy reaches 78.8 percent and Fisher-$z$ correlation reaches $r = 0.590$ on the SOEP held-out evaluation set. The findings suggest that twin-based market research is no longer gated by data design, but by item volume, model selection, and a small set of construction-level decisions that this paper now maps.
Problem

Research questions and friction points this paper is trying to address.

digital twins
LLM-based personas
socio-economic microdata
market research
individual-level simulation
Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM-based digital twins
socio-economic microdata
information depth
embedding methods
reasoning modes
🔎 Similar Papers
No similar papers found.