🤖 AI Summary
This study addresses the implicit biases exhibited by large language models in socially sensitive scenarios involving intersecting identities such as race and gender. It presents the first systematic evaluation of six mainstream models using a multidimensional fairness framework—encompassing bias scores, subgroup fairness, accuracy, and response consistency—combined with repeated runs across ambiguous and disambiguated contexts. By constructing a targeted benchmark dataset, the authors demonstrate that model predictions in ambiguous contexts are often too sparse to support reliable fairness assessments, whereas in disambiguated contexts, accuracy is significantly influenced by alignment with stereotypes, particularly exacerbating biases for race–gender intersectional groups. Notably, no model maintains consistent fairness across all settings. These findings reveal a critical disconnect between surface-level performance and underlying bias, underscoring the necessity of moving beyond accuracy alone in fairness evaluations.
📝 Abstract
Large Language Models (LLMs) are increasingly deployed in socially sensitive settings, raising concerns about fairness and biases, particularly across intersectional demographic attributes. In this paper, we systematically evaluate intersectional fairness in six LLMs using ambiguous and disambiguated contexts from two benchmark datasets. We assess LLM behavior using bias scores, subgroup fairness metrics, accuracy, and consistency through multi-run analysis across contexts and negative and non-negative question polarities. Our results show that while modern LLMs generally perform well in ambiguous contexts, this limits the informativeness of fairness metrics due to sparse non-unknown predictions. In disambiguated contexts, LLM accuracy is influenced by stereotype alignment, with models being more accurate when the correct answer reinforces a stereotype than when it contradicts it. This pattern is especially pronounced in race-gender intersections, where directional bias toward stereotypes is stronger. Subgroup fairness metrics further indicate that, despite low observed disparity in some cases, outcome distributions remain uneven across intersectional groups. Across repeated runs, responses also vary in consistency, including stereotype-aligned responses. Overall, our findings show that apparent model competence is partly associated with stereotype-consistent cues, and no evaluated LLM achieves consistently reliable or fair behavior across intersectional settings. These findings highlight the need for evaluation beyond accuracy, emphasizing the importance of combining bias, subgroup fairness, and consistency metrics across intersectional groups, contexts, and repeated runs.