GenFair: Systematic Test Generation for Fairness Fault Detection in Large Language Models

📅 2025-06-03

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

To address fairness concerns in large language models (LLMs) arising from inherited biases in training data—particularly elusive intersectional biases—this paper proposes the first fairness assessment framework integrating software testing theory with metamorphic testing. Methodologically, it innovatively combines equivalence class partitioning, mutation operators, and boundary value analysis to generate test cases exhibiting high diversity, realism, and intersectionality. It further introduces tone-aware metamorphic relations to detect fairness violations in model responses. Experimental evaluation on GPT-4 and LLaMA-3 demonstrates fault detection rates of 0.73 and 0.69, respectively—significantly outperforming baseline approaches. The framework also achieves superior syntactic and semantic diversity scores (10.06 and 76.68) and maintains higher response coherence across all metrics.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) are increasingly deployed in critical domains, yet they often exhibit biases inherited from training data, leading to fairness concerns. This work focuses on the problem of effectively detecting fairness violations, especially intersectional biases that are often missed by existing template-based and grammar-based testing methods. Previous approaches, such as CheckList and ASTRAEA, provide structured or grammar-driven test generation but struggle with low test diversity and limited sensitivity to complex demographic interactions. To address these limitations, we propose GenFair, a metamorphic fairness testing framework that systematically generates source test cases using equivalence partitioning, mutation operators, and boundary value analysis. GenFair improves fairness testing by generating linguistically diverse, realistic, and intersectional test cases. It applies metamorphic relations (MR) to derive follow-up cases and detects fairness violations via tone-based comparisons between source and follow-up responses. In experiments with GPT-4.0 and LLaMA-3.0, GenFair outperformed two baseline methods. It achieved a fault detection rate (FDR) of 0.73 (GPT-4.0) and 0.69 (LLaMA-3.0), compared to 0.54/0.51 for template-based and 0.39/0.36 for ASTRAEA. GenFair also showed the highest test case diversity (syntactic:10.06, semantic: 76.68) and strong coherence (syntactic: 291.32, semantic: 0.7043), outperforming both baselines. These results demonstrate the effectiveness of GenFair in uncovering nuanced fairness violations. The proposed method offers a scalable and automated solution for fairness testing and contributes to building more equitable LLMs.

Problem

Research questions and friction points this paper is trying to address.

Detecting fairness violations in Large Language Models (LLMs)

Addressing intersectional biases missed by existing testing methods

Improving test diversity and sensitivity for complex demographic interactions

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses metamorphic fairness testing framework

Generates diverse test cases systematically

Detects fairness violations via tone comparisons

🔎 Similar Papers

LangBiTe: A Platform for Testing Bias in Large Language Models

2024-04-29arXiv.orgCitations: 2

Does RAG Introduce Unfairness in LLMs? Evaluating Fairness in Retrieval-Augmented Generation Systems

2024-09-29International Conference on Computational LinguisticsCitations: 4