Alignment Reduces Expressed but Not Encoded Gender Bias: A Unified Framework and Study

📅 2026-03-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the persistence of internalized gender bias in large language models (LLMs) despite current alignment techniques that primarily mitigate only surface-level, explicit bias in model outputs. The authors propose a unified analytical framework that employs shared neutral prompts to simultaneously probe intrinsic gender information encoded in internal representations and explicit bias manifested in generated text. Under this unified protocol, they reveal—for the first time—a consistent correlation between internal and external biases. Their findings demonstrate that while alignment methods suppress overt bias in outputs, latent internal bias remains intact and can be reactivated by adversarial prompts. Evaluations across structured benchmarks and realistic scenarios such as story generation further show that existing supervised fine-tuning–based alignment strategies merely mask, rather than eliminate, encoded biases, with limited generalization to complex, real-world applications.

Technology Category

Application Category

📝 Abstract
During training, Large Language Models (LLMs) learn social regularities that can lead to gender bias in downstream applications. Most mitigation efforts focus on reducing bias in generated outputs, typically evaluated on structured benchmarks, which raises two concerns: output-level evaluation does not reveal whether alignment modifies the model's underlying representations, and structured benchmarks may not reflect realistic usage scenarios. We propose a unified framework to jointly analyze intrinsic and extrinsic gender bias in LLMs using identical neutral prompts, enabling direct comparison between gender-related information encoded in internal representations and bias expressed in generated outputs. Contrary to prior work reporting weak or inconsistent correlations, we find a consistent association between latent gender information and expressed bias when measured under the unified protocol. We further examine the effect of alignment through supervised fine-tuning aimed at reducing gender bias. Our results suggest that while the latter indeed reduces expressed bias, measurable gender-related associations are still present in internal representations, and can be reactivated under adversarial prompting. Finally, we consider two realistic settings and show that debiasing effects observed on structured benchmarks do not necessarily generalize, e.g., to the case of story generation.
Problem

Research questions and friction points this paper is trying to address.

gender bias
large language models
alignment
intrinsic bias
extrinsic bias
Innovation

Methods, ideas, or system contributions that make the work stand out.

gender bias
large language models
alignment
intrinsic bias
adversarial prompting
🔎 Similar Papers
No similar papers found.
N
Nour Bouchouchi
Sorbonne Université, CNRS, LIP6, F-75005 Paris, France
T
Thiabult Laugel
Sorbonne Université, CNRS, LIP6, F-75005 Paris, France; AXA, Paris, France
Xavier Renard
Xavier Renard
LIP6
Artificial intelligenceMachine learning
Christophe Marsala
Christophe Marsala
LIP6 - Sorbonne Université
AIFuzzy logicMachine Learning
Marie-Jeanne Lesot
Marie-Jeanne Lesot
LIP6, Sorbonne Université
M
Marcin Detyniecki
Sorbonne Université, CNRS, LIP6, F-75005 Paris, France; AXA, Paris, France; Polish Academy of Science, IBS PAN, Warsaw, Poland