$ exttt{BluePrint}$: A Social Media User Dataset for LLM Persona Evaluation and Training

📅 2025-09-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing research lacks standardized datasets to train and evaluate large language models (LLMs) as social media agents. To address this gap, we propose SIMPACT, a framework that introduces BluePrint—the first publicly available, LLM-oriented social media user dataset for persona modeling. BluePrint is constructed from publicly accessible Bluesky political discourse data, employing behavioral clustering and pseudonymization to generate synthetic yet behaviorally authentic user personas. We innovatively formalize 12 fine-grained social actions as a “next-action prediction” task and design a dual-level evaluation metric—comprising group-level and holistic behavioral fidelity—to rigorously assess agent behavior. BluePrint enables context-sensitive social behavior modeling and serves as a scalable, domain-adaptable benchmark for studying political discourse, misinformation diffusion, and societal polarization.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) offer promising capabilities for simulating social media dynamics at scale, enabling studies that would be ethically or logistically challenging with human subjects. However, the field lacks standardized data resources for fine-tuning and evaluating LLMs as realistic social media agents. We address this gap by introducing SIMPACT, the SIMulation-oriented Persona and Action Capture Toolkit, a privacy respecting framework for constructing behaviorally-grounded social media datasets suitable for training agent models. We formulate next-action prediction as a task for training and evaluating LLM-based agents and introduce metrics at both the cluster and population levels to assess behavioral fidelity and stylistic realism. As a concrete implementation, we release BluePrint, a large-scale dataset built from public Bluesky data focused on political discourse. BluePrint clusters anonymized users into personas of aggregated behaviours, capturing authentic engagement patterns while safeguarding privacy through pseudonymization and removal of personally identifiable information. The dataset includes a sizable action set of 12 social media interaction types (likes, replies, reposts, etc.), each instance tied to the posting activity preceding it. This supports the development of agents that use context-dependence, not only in the language, but also in the interaction behaviours of social media to model social media users. By standardizing data and evaluation protocols, SIMPACT provides a foundation for advancing rigorous, ethically responsible social media simulations. BluePrint serves as both an evaluation benchmark for political discourse modeling and a template for building domain specific datasets to study challenges such as misinformation and polarization.
Problem

Research questions and friction points this paper is trying to address.

Lacks standardized data for fine-tuning social media LLM agents
Needs privacy-respecting datasets for realistic user behavior simulation
Requires evaluation metrics for behavioral fidelity in social simulations
Innovation

Methods, ideas, or system contributions that make the work stand out.

Framework constructs privacy-respecting social media datasets
Task formulates next-action prediction for agent training
Dataset clusters anonymized users into behavioral personas
🔎 Similar Papers
No similar papers found.