CrowdLLM: Building LLM-Based Digital Populations Augmented with Generative Models

📅 2025-12-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing LLM-driven digital population approaches struggle to simultaneously ensure behavioral authenticity and group diversity. This paper proposes a collaborative framework integrating large language models (LLMs) with generative models, jointly optimizing heterogeneous, cross-domain behavioral signals (e.g., voting, rating) to achieve high-fidelity modeling of real-world population distributions. Key contributions include: (1) leveraging a generative model to calibrate statistical biases in LLM outputs, thereby enhancing individual behavioral diversity and aggregate population representativeness; and (2) introducing a distribution-matching mechanism enabling scalable, task-agnostic simulation across domains—including social simulation, crowdsourcing, and recommender systems. Experiments demonstrate that our method closely approximates empirical population distributions across multiple domains, outperforming pure-LLM baselines on both accuracy and diversity metrics. The framework establishes a novel paradigm for constructing low-cost, high-fidelity synthetic populations.

Technology Category

Application Category

📝 Abstract
The emergence of large language models (LLMs) has sparked much interest in creating LLM-based digital populations that can be applied to many applications such as social simulation, crowdsourcing, marketing, and recommendation systems. A digital population can reduce the cost of recruiting human participants and alleviate many concerns related to human subject study. However, research has found that most of the existing works rely solely on LLMs and could not sufficiently capture the accuracy and diversity of a real human population. To address this limitation, we propose CrowdLLM that integrates pretrained LLMs and generative models to enhance the diversity and fidelity of the digital population. We conduct theoretical analysis of CrowdLLM regarding its great potential in creating cost-effective, sufficiently representative, scalable digital populations that can match the quality of a real crowd. Comprehensive experiments are also conducted across multiple domains (e.g., crowdsourcing, voting, user rating) and simulation studies which demonstrate that CrowdLLM achieves promising performance in both accuracy and distributional fidelity to human data.
Problem

Research questions and friction points this paper is trying to address.

Enhances digital population diversity and fidelity
Integrates LLMs with generative models for accuracy
Reduces costs and improves representation in simulations
Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates pretrained LLMs with generative models
Enhances diversity and fidelity of digital populations
Achieves accuracy and distributional fidelity to human data
🔎 Similar Papers
No similar papers found.
R
Ryan Feng Lin
Department of Industrial and Systems Engineering, University of Washington, Seattle, WA 98195, USA
Keyu Tian
Keyu Tian
Department of Data Science, City University of Hong Kong, Kowloon, Hong Kong
H
Hanming Zheng
Department of Data Science, City University of Hong Kong, Kowloon, Hong Kong
C
Congjing Zhang
Department of Industrial and Systems Engineering, University of Washington, Seattle, WA 98195, USA
Li Zeng
Li Zeng
Peking University
LLM training and inferenceVector ComputingGraph Computing
S
Shuai Huang
Department of Industrial and Systems Engineering, University of Washington, Seattle, WA 98195, USA