What is a protest anyway? Codebook conceptualization is still a first-order concern in LLM-era classification

📅 2025-10-03
📈 Citations: 0
Influential: 0
📄 PDF

career value

156K/year
🤖 AI Summary
This paper identifies a critical pitfall in computational social science text classification under the LLM era: *conceptualization bias*—the absence of explicit, operational definitions for target concepts (e.g., “protest events”) induces irreversible downstream statistical inference bias, unmitigable by improving model accuracy or post-hoc calibration. We employ generative large language models for text classification and develop a simulation framework to quantify how conceptualization error propagates into estimation bias and variance. Our core contribution is the first systematic demonstration that conceptual rigor must precede model optimization; we propose a *codebook-driven paradigm* for low-bias practice, mandating integration of precise, reproducible concept definitions—including boundary cases and decision rules—throughout the analytical pipeline. Results show that embedding structured codebooks enables cost-effective, unbiased, and low-variance causal or descriptive estimation.

Technology Category

Application Category

📝 Abstract
Generative large language models (LLMs) are now used extensively for text classification in computational social science (CSS). In this work, focus on the steps before and after LLM prompting -- conceptualization of concepts to be classified and using LLM predictions in downstream statistical inference -- which we argue have been overlooked in much of LLM-era CSS. We claim LLMs can tempt analysts to skip the conceptualization step, creating conceptualization errors that bias downstream estimates. Using simulations, we show that this conceptualization-induced bias cannot be corrected for solely by increasing LLM accuracy or post-hoc bias correction methods. We conclude by reminding CSS analysts that conceptualization is still a first-order concern in the LLM-era and provide concrete advice on how to pursue low-cost, unbiased, low-variance downstream estimates.
Problem

Research questions and friction points this paper is trying to address.

LLMs tempt analysts to skip conceptualization, causing biased estimates
Conceptualization-induced bias persists despite improved LLM accuracy
CSS must prioritize conceptualization for unbiased downstream inference
Innovation

Methods, ideas, or system contributions that make the work stand out.

Focus on conceptualization before LLM prompting
Address conceptualization-induced bias in estimates
Provide advice for unbiased downstream inference