🤖 AI Summary
This paper addresses three critical challenges in detecting social biases in large language models (LLMs): (1) probe selection lacking theoretical grounding, (2) inconsistent results across multiple bias-detection tools, and (3) poor generalizability of detection outcomes to real-world user behavior. To tackle these issues, we propose EcoLevels—the first evaluation framework integrating principles from social science. EcoLevels systematically incorporates ecological validity, multi-level measurement, and theory-driven design, unifying social psychological experimentation, multitrait–multimethod matrices, causal inference, and LLM-specific techniques including prompt engineering and bias quantification. It is the first framework to jointly address probe selection, result integration, and behavioral generalization prediction. Empirical evaluation across seven social dimensions demonstrates substantial improvements: cross-probe conclusion consistency increases by 62%, and generalization accuracy improves by 3.8× compared to prior methods—significantly enhancing interpretability and real-world predictive validity.
📝 Abstract
The proliferation of LLM bias probes introduces three significant challenges: (1) we lack principled criteria for choosing appropriate probes, (2) we lack a system for reconciling conflicting results across probes, and (3) we lack formal frameworks for reasoning about when (and why) probe results will generalize to real user behavior. We address these challenges by systematizing LLM social bias probing using actionable insights from social sciences. We then introduce EcoLevels - a framework that helps (a) determine appropriate bias probes, (b) reconcile conflicting findings across probes, and (c) generate predictions about bias generalization. Overall, we ground our analysis in social science research because many LLM probes are direct applications of human probes, and these fields have faced similar challenges when studying social bias in humans. Based on our work, we suggest how the next generation of LLM bias probing can (and should) benefit from decades of social science research.