Identifying Features Associated with Bias Against 93 Stigmatized Groups in Language Models and Guardrail Model Safety Mitigation

📅 2025-12-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Social stigma toward marginalized groups may systematically bias large language model (LLM) outputs, yet the psychological mechanisms linking stigma attributes to LLM bias remain uncharacterized. Method: We systematically analyze associations between six core social stigma psychology dimensions—e.g., dangerousness, concealability, origin—and LLM bias across 93 stigmatized groups, using the SocialStigmaQA benchmark and human-annotated ground truth. We evaluate three open-weight models (Granite-3.0-8B, Llama-3.1-8B, Mistral-7B) and their corresponding guardrail models (e.g., Granite Guardian). Contribution/Results: We identify a strong positive correlation between stigma attributes—particularly dangerousness—and LLM bias magnitude: bias rates reach 60% for high-dangerousness groups (e.g., gang members, people living with HIV), versus only 11% for sociodemographic groups. Guardrail models reduce bias by only 7.2% on average (max 10.4%), fail to detect bias intent, preserve core stigma-driving features, and miss >50% of biased prompts. This work establishes a psychological foundation and empirical evidence for modeling and governing LLM bias at its root.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) have been shown to exhibit social bias, however, bias towards non-protected stigmatized identities remain understudied. Furthermore, what social features of stigmas are associated with bias in LLM outputs is unknown. From psychology literature, it has been shown that stigmas contain six shared social features: aesthetics, concealability, course, disruptiveness, origin, and peril. In this study, we investigate if human and LLM ratings of the features of stigmas, along with prompt style and type of stigma, have effect on bias towards stigmatized groups in LLM outputs. We measure bias against 93 stigmatized groups across three widely used LLMs (Granite 3.0-8B, Llama-3.1-8B, Mistral-7B) using SocialStigmaQA, a benchmark that includes 37 social scenarios about stigmatized identities; for example deciding wether to recommend them for an internship. We find that stigmas rated by humans to be highly perilous (e.g., being a gang member or having HIV) have the most biased outputs from SocialStigmaQA prompts (60% of outputs from all models) while sociodemographic stigmas (e.g. Asian-American or old age) have the least amount of biased outputs (11%). We test if the amount of biased outputs could be decreased by using guardrail models, models meant to identify harmful input, using each LLM's respective guardrail model (Granite Guardian 3.0, Llama Guard 3.0, Mistral Moderation API). We find that bias decreases significantly by 10.4%, 1.4%, and 7.8%, respectively. However, we show that features with significant effect on bias remain unchanged post-mitigation and that guardrail models often fail to recognize the intent of bias in prompts. This work has implications for using LLMs in scenarios involving stigmatized groups and we suggest future work towards improving guardrail models for bias mitigation.
Problem

Research questions and friction points this paper is trying to address.

Identifying bias features in LLMs for stigmatized groups
Measuring bias across 93 stigmatized groups in three LLMs
Evaluating guardrail models' effectiveness in reducing bias
Innovation

Methods, ideas, or system contributions that make the work stand out.

Analyzing bias in LLMs using stigma social features
Measuring bias across 93 groups with SocialStigmaQA benchmark
Testing guardrail models for bias mitigation effectiveness
🔎 Similar Papers
No similar papers found.