GenderBench: Evaluation Suite for Gender Biases in LLMs

📅 2025-05-17

📈 Citations: 0

✨ Influential: 0

career value

190K/year

🤖 AI Summary

This work addresses gender bias in large language models (LLMs) by introducing GenderBench, the first comprehensive evaluation benchmark for this issue. GenderBench systematically defines and quantifies 19 categories of gender-related harmful behaviors—including stereotypical reasoning, imbalanced textual representation, and discriminatory outputs in high-stakes scenarios such as hiring. It proposes a modular, reproducible, and extensible multi-dimensional evaluation framework integrating controllable prompt-based probes, behavioral statistical analysis, and cross-model consistency assessment. The authors release an open-source Python evaluation library and empirically assess 12 mainstream LLMs, uncovering three pervasive classes of systematic gender bias. All benchmark data, evaluation scripts, and model outputs are publicly available. By providing a standardized, transparent, and scalable infrastructure, GenderBench establishes a foundational resource for rigorous fairness evaluation and mitigation research in LLMs.

Technology Category

Application Category

📝 Abstract

We present GenderBench -- a comprehensive evaluation suite designed to measure gender biases in LLMs. GenderBench includes 14 probes that quantify 19 gender-related harmful behaviors exhibited by LLMs. We release GenderBench as an open-source and extensible library to improve the reproducibility and robustness of benchmarking across the field. We also publish our evaluation of 12 LLMs. Our measurements reveal consistent patterns in their behavior. We show that LLMs struggle with stereotypical reasoning, equitable gender representation in generated texts, and occasionally also with discriminatory behavior in high-stakes scenarios, such as hiring.

Problem

Research questions and friction points this paper is trying to address.

Measure gender biases in LLMs comprehensively

Quantify harmful gender-related behaviors in LLMs

Evaluate stereotypical reasoning and discriminatory behavior

Innovation

Methods, ideas, or system contributions that make the work stand out.

Comprehensive evaluation suite for gender biases

Open-source extensible library for reproducibility

Measures stereotypical reasoning and discriminatory behavior

🔎 Similar Papers

GenderCARE: A Comprehensive Framework for Assessing and Reducing Gender Bias in Large Language Models