Managing Cognitive Bias in Human Labeling Operations for Rare-Event AI: Evidence from a Field Experiment

πŸ“… 2026-03-11
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This study addresses the high false-negative rates in AI models for rare events, which often stem from cognitive biases in human annotations that induce label bias and degrade model performance. Conducting a field experiment on the medical crowdsourcing platform DiagnosUs, the authors propose a novel framework that systematically mitigates annotation bias by integrating balanced positive-sample feedback, probabilistic labeling, and pipeline-level log-odds linear recalibration. By comparing binary and probabilistic annotation schemes and training convolutional neural networks on recalibrated labels, the approach significantly reduces missed detections while improving both classification accuracy and prediction calibration. These gains remain robust in out-of-sample evaluations, demonstrating the method’s generalizability and practical efficacy in real-world diagnostic settings.

Technology Category

Application Category

πŸ“ Abstract
Many operational AI systems depend on large-scale human annotation to detect rare but consequential events (e.g., fraud, defects, and medical abnormalities). When positives are rare, the prevalence effect induces systematic cognitive biases that inflate misses and can propagate through the AI lifecycle via biased training labels. We analyze prior experimental evidence and run a field experiment on DiagnosUs, a medical crowdsourcing platform, in which we hold the true prevalence in the unlabeled stream fixed (20% blasts) while varying (i) the prevalence of positives in the gold-standard feedback stream (20% vs. 50%) and (ii) the response interface (binary labels vs. elicited probabilities). We then post-process probabilistic labels using a linear-in-log-odds recalibration approach at the worker and crowd levels, and train convolutional neural networks on the resulting labels. Balanced feedback and probabilistic elicitation reduce rare-event misses, and pipeline-level recalibration substantially improves both classification performance and probabilistic calibration; these gains carry through to downstream CNN reliability out of sample.
Problem

Research questions and friction points this paper is trying to address.

cognitive bias
rare-event detection
human labeling
prevalence effect
AI reliability
Innovation

Methods, ideas, or system contributions that make the work stand out.

prevalence effect
probabilistic elicitation
label recalibration
rare-event detection
human-in-the-loop AI
πŸ”Ž Similar Papers
No similar papers found.
G
Gunnar P. Epping
Cognitive Science Program, Indiana University, Indiana, USA; Department of Psychological and Brain Sciences, Indiana University, Indiana, USA; Centaur Labs, Massachusetts, USA
Andrew Caplin
Andrew Caplin
New York University
Many
E
Erik Duhaime
Centaur Labs, Massachusetts, USA
William R. Holmes
William R. Holmes
Mathematics and Cognitive Science
Mathematical Biology / Mathematical Psychology
Daniel Martin
Daniel Martin
University of California, Santa Barbara
Behavioral EconomicsCognitive EconomicsExperimental EconomicsHumans and AI
J
Jennifer S. Trueblood
Cognitive Science Program, Indiana University, Indiana, USA; Department of Psychological and Brain Sciences, Indiana University, Indiana, USA