Do LLMs have a Gender (Entropy) Bias?

📅 2025-05-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper identifies and empirically validates “entropy bias”—a systematic disparity in the information content of large language model (LLM) responses to real-world questions posed by male versus female users. Method: We construct RealWorldQuestioning, a benchmark spanning education, employment, finance, and health, revealing significant gendered information imbalance at the individual-question level—often obscured by category-level averaging. Our approach integrates information-theoretic entropy quantification, LLM-as-judge evaluation, cross-model comparison, and an iterative gender-aware response fusion prompting strategy. Contribution/Results: We propose a lightweight, prompt-level debiasing method that, across diverse LLMs, exceeds the informational yield of single-gender responses in 78% of cases and achieves balanced integration in the remainder. Empirical results demonstrate both efficacy and practical deployability, establishing entropy bias as a measurable, addressable dimension of gender bias in LLMs.

Technology Category

Application Category

📝 Abstract
We investigate the existence and persistence of a specific type of gender bias in some of the popular LLMs and contribute a new benchmark dataset, RealWorldQuestioning (released on HuggingFace ), developed from real-world questions across four key domains in business and health contexts: education, jobs, personal financial management, and general health. We define and study entropy bias, which we define as a discrepancy in the amount of information generated by an LLM in response to real questions users have asked. We tested this using four different LLMs and evaluated the generated responses both qualitatively and quantitatively by using ChatGPT-4o (as"LLM-as-judge"). Our analyses (metric-based comparisons and"LLM-as-judge"evaluation) suggest that there is no significant bias in LLM responses for men and women at a category level. However, at a finer granularity (the individual question level), there are substantial differences in LLM responses for men and women in the majority of cases, which"cancel"each other out often due to some responses being better for males and vice versa. This is still a concern since typical users of these tools often ask a specific question (only) as opposed to several varied ones in each of these common yet important areas of life. We suggest a simple debiasing approach that iteratively merges the responses for the two genders to produce a final result. Our approach demonstrates that a simple, prompt-based debiasing strategy can effectively debias LLM outputs, thus producing responses with higher information content than both gendered variants in 78% of the cases, and consistently achieving a balanced integration in the remaining cases.
Problem

Research questions and friction points this paper is trying to address.

Investigates gender entropy bias in popular LLMs
Analyzes bias in LLM responses at question level
Proposes prompt-based debiasing for balanced outputs
Innovation

Methods, ideas, or system contributions that make the work stand out.

New benchmark dataset RealWorldQuestioning for bias testing
Defined and studied entropy bias in LLM responses
Simple prompt-based debiasing strategy improves information content
🔎 Similar Papers
No similar papers found.