🤖 AI Summary
This paper identifies a critical pitfall in computational social science text classification under the LLM era: *conceptualization bias*—the absence of explicit, operational definitions for target concepts (e.g., “protest events”) induces irreversible downstream statistical inference bias, unmitigable by improving model accuracy or post-hoc calibration. We employ generative large language models for text classification and develop a simulation framework to quantify how conceptualization error propagates into estimation bias and variance. Our core contribution is the first systematic demonstration that conceptual rigor must precede model optimization; we propose a *codebook-driven paradigm* for low-bias practice, mandating integration of precise, reproducible concept definitions—including boundary cases and decision rules—throughout the analytical pipeline. Results show that embedding structured codebooks enables cost-effective, unbiased, and low-variance causal or descriptive estimation.
📝 Abstract
Generative large language models (LLMs) are now used extensively for text classification in computational social science (CSS). In this work, focus on the steps before and after LLM prompting -- conceptualization of concepts to be classified and using LLM predictions in downstream statistical inference -- which we argue have been overlooked in much of LLM-era CSS. We claim LLMs can tempt analysts to skip the conceptualization step, creating conceptualization errors that bias downstream estimates. Using simulations, we show that this conceptualization-induced bias cannot be corrected for solely by increasing LLM accuracy or post-hoc bias correction methods. We conclude by reminding CSS analysts that conceptualization is still a first-order concern in the LLM-era and provide concrete advice on how to pursue low-cost, unbiased, low-variance downstream estimates.