Codebook LLMs: Evaluating LLMs as Measurement Tools for Political Science Concepts

📅 2024-07-15

📈 Citations: 0

✨ Influential: 0

career value

175K/year

🤖 AI Summary

Large language models (LLMs) often fail to accurately operationalize political science concepts—such as protest events, political violence, and party manifestos—according to human-constructed codebooks, limiting their utility in social science research. Method: We propose the first five-stage LLM measurement framework tailored to political science; curate and publicly release three real-world, human-annotated codebook datasets; and evaluate open-weight LLMs (7–12B parameters) using zero-shot prompting, error attribution analysis, and lightweight LoRA fine-tuning. Contribution/Results: While mainstream open-weight LLMs exhibit limited zero-shot performance on codebook-based annotation tasks, instruction-tuned variants achieve substantial accuracy gains with minimal computational overhead. Emphasizing reproducible evaluation, this work delivers a practical, empirically grounded guide for social scientists deploying LLMs in codebook-driven measurement, alongside open-source tools and benchmark datasets to support transparent, scalable implementation.

Technology Category

Application Category

📝 Abstract

Codebooks -- documents that operationalize concepts and outline annotation procedures -- are used almost universally by social scientists when coding political texts. To code these texts automatically, researchers are increasing turning to generative large language models (LLMs). However, there is limited empirical evidence on whether"off-the-shelf"LLMs faithfully follow real-world codebook operationalizations and measure complex political constructs with sufficient accuracy. To address this, we gather and curate three real-world political science codebooks -- covering protest events, political violence and manifestos -- along with their unstructured texts and human labels. We also propose a five-stage framework for codebook-LLM measurement: preparing a codebook for both humans and LLMs, testing LLMs' basic capabilities on a codebook, evaluating zero-shot measurement accuracy (i.e. off-the-shelf performance), analyzing errors, and further (parameter-efficient) supervised training of LLMs. We provide an empirical demonstration of this framework using our three codebook datasets and several pretrained 7-12 billion open-weight LLMs. We find current open-weight LLMs have limitations in following codebooks zero-shot, but that supervised instruction tuning can substantially improve performance. Rather than suggesting the"best"LLM, our contribution lies in our codebook datasets, evaluation framework, and guidance for applied researchers who wish to implement their own codebook-LLM measurement projects.

Problem

Research questions and friction points this paper is trying to address.

Large Language Models

Political Science Concepts

Educational Accessibility

Innovation

Methods, ideas, or system contributions that make the work stand out.

Large Language Models

Political Text Analysis

Enhanced Accuracy through Training

🔎 Similar Papers

No similar papers found.