🤖 AI Summary
This work addresses the lack of standardized data collection, user segmentation, and evaluation protocols in community-conditioned language model adaptation, which hinders comparability and reproducibility. We propose the first modular and reusable adaptation framework, built upon 112 well-being–related Reddit subreddits encompassing 300,000 users and 16 million comments. The framework systematically implements five user grouping strategies—subreddit-based, graph-structured, semantic, hybrid, and interaction-based—and employs QLoRA for parameter-efficient fine-tuning. Through a unified evaluation assessing linguistic fluency, faithfulness, distributional alignment, and community identifiability, we reveal a consistent trade-off between community identifiability and textual distributional similarity. Furthermore, adapter performance is shown to be highly sensitive to the alignment between the chosen grouping strategy and the subreddit’s intrinsic baseline characteristics.
📝 Abstract
Community-conditioned language model adaptation requires choices about data collection, community definition, and evaluation that are currently made independently in each study, making it hard to compare assumptions or reuse artifacts. We present RedditPersona, a modular framework that standardizes these choices: it collects Reddit posts and comments, profiles active users, partitions them under five grouping strategies (subreddit-based, graph-structural, semantic, hybrid, and interaction-based), trains a parameter-efficient adapter per strategy via QLoRA, and evaluates them under a shared metric suite spanning fluency, fidelity, distributional alignment, and community identifiability. Applied to 112 subreddits in the urban well-being domain (301,429 user profiles, 16M+ comments), we find that adapters' behavioral identifiability tracks each strategy's intrinsic agreement with the subreddit baseline, and that a consistent trade-off between identifiability and distributional similarity to real text holds across all five strategies. The code and configuration files are available at: https://github.com/Ahghaffari/redditpersona.