Can Crowdsourcing Survive the LLM Era? A Community Survey on Human Data Collection

📅 2026-06-03

📈 Citations: 0

✨ Influential: 0

career value

185K/year

🤖 AI Summary

This study addresses the growing threat posed by large language models (LLMs) to the authenticity and quality of crowdsourced textual data through automated content generation. For the first time, it systematically investigates the academic community’s awareness of LLM-induced contamination in crowdsourced datasets, current mitigation strategies, and associated challenges, based on a large-scale survey of 155 researchers in NLP and related fields, complemented by qualitative and quantitative analyses. Findings reveal that 44% of respondents have observed LLM-generated artifacts in their data, while 93% anticipated this issue; however, nearly half lack effective countermeasures. Building on these insights, the paper proposes practical guidelines for collecting crowdsourced data in the LLM era, addressing a critical gap in current research practices.

📝 Abstract

The widespread use of Large Language Models (LLMs) as writing tools challenges the validity of crowdsourced data, as crowdworkers may outsource tasks to models. To better understand how this is addressed, we surveyed 155 researchers in NLP and related disciplines about their experiences and opinions on collecting free-text responses via crowdsourcing. This paper provides an overview of practitioners' challenges, mitigation strategies, and the foreseen implications on data quality. 44% of respondents reported observing LLM usage in their crowdsourced data. While 93% of them had anticipated this, half were unsure what precautions to take. The most prevalent detection strategies are distinctive textual style patterns and unusually fast completion times. Overall, survey responses show that the research community is aware of the problem and taking measures, but existing efforts remain insufficient to fully address it. Finally, we derive a set of considerations to guide future crowdsourced free-text data collection in the era of LLMs.

Problem

Research questions and friction points this paper is trying to address.

crowdsourcing

Large Language Models

data validity

human data collection

LLM misuse

Innovation

Methods, ideas, or system contributions that make the work stand out.

crowdsourcing

Large Language Models

data quality