🤖 AI Summary
This study investigates whether large language model (LLM)-driven AI coding agents diminish methodological diversity in social science analysis or exacerbate confirmation bias stemming from analytical flexibility. By executing 20 independent runs each of Claude Code and Codex on a dataset concerning immigration and social policy—and comparing their outputs against a human multi-analyst benchmark—the research provides the first systematic evaluation that disentangles method design from conclusion interpretation. The findings reveal that AI agents exhibit methodological diversity in model specification comparable to, or even exceeding, that of humans, with effect estimates closely aligning with human consensus. However, their interpretive conclusions are highly sensitive to prior biases embedded in prompts: support for a given hypothesis can swing from 10% to 90% despite nearly identical coefficient distributions, indicating that bias arises primarily during interpretation rather than estimation.
📝 Abstract
The deployment of LLM-based agents in scientific analysis raises opposing concerns: that agents may reduce methodological diversity, or that they may amplify the analytic flexibility through which researchers reach motivated conclusions. We argue these worries target two empirically separable layers: a design layer of methodological choices, and a verdict layer in which a decision rule maps estimates to a substantive claim. We test both by running 20 independent executions of Claude Code and Codex on a prominent immigration and social-policy against a many-analysts human baseline. At the design layer, Codex matches human methodological diversity and Claude Code produces nearly three times as many specifications; both agents' effect estimates remain broadly aligned with the human consensus, and no agent model exactly matches any human model. A prompt-induced anti-immigration researcher prior reorganizes each agent's methodological decisions but, unlike for biased human analysts in the same data, does not shift aggregate estimates or final verdicts; nor do agents reroute along the methodological axes humans use to bias their estimates. At the verdict layer, an explicit confirmatory prompt flips Claude Code's verdicts from 10% to 90% support while leaving its coefficient distribution essentially unchanged, operating through rule omission rather than rule softening. AI agents can rival or exceed human methodological diversity at the design layer while remaining vulnerable at the verdict layer. In our setting, the locus of AI bias is not estimation but interpretation.