When Better Codebooks Are Not Enough: Predictive Performance and Behavioral Reliability in LLM Political Event Coding

📅 2026-06-04

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

While large language models (LLMs) can achieve high accuracy in coding political events, they often fail to faithfully adhere to expert-defined coding rules, leading to unreliable behavior. This study systematically evaluates LLMs’ logical consistency under controlled perturbations—such as variations in label names and coding order—by enhancing structured codebooks through precise terminology, illustrative examples, retrieval-augmented context, and rules for challenging cases, combined with prompt engineering. The work reveals, for the first time, a critical disconnect between predictive performance and behavioral reliability, demonstrating that high accuracy does not necessarily imply compliance with social science coding logic. Although the refined codebooks substantially improve fine-grained classification performance, the models remain sensitive to minor codebook modifications, underscoring the necessity of explicit reliability assessment in computational social science applications.

📝 Abstract

High accuracy does not necessarily make an LLM a faithful coder. This issue matters because many social-science studies rely on expert-written codebooks to turn text into structured data. We study this problem in political event coding, a challenging source-target relation classification task beyond ordinary sentence-level classification, where models must determine what one actor did to another using detailed coding rules. We test whether expert codebooks become more effective when operationalized into LLM-friendly forms with clearer definitions, examples, retrieved context, and rules for difficult cases. We then evaluate behavioral reliability under controlled changes to label names, codebook order, and label-definition mappings. Clearer codebooks substantially improve classification performance, especially for fine-grained event classification. However, these predictive gains do not fully translate into behavioral reliability. Models may produce valid labels and recover definitions while still failing behavioral reliability tests under controlled codebook changes. These findings suggest that codebook-guided LLM systems should be evaluated not only by accuracy, but also by whether they preserve the coding logic that makes coded outputs meaningful for social-science research.

Problem

Research questions and friction points this paper is trying to address.

LLM political event coding

behavioral reliability

codebook-guided coding

predictive performance

structured data generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

behavioral reliability

codebook optimization

LLM political event coding