🤖 AI Summary
To address fairness concerns in large language models (LLMs) arising from inherited biases in training data—particularly elusive intersectional biases—this paper proposes the first fairness assessment framework integrating software testing theory with metamorphic testing. Methodologically, it innovatively combines equivalence class partitioning, mutation operators, and boundary value analysis to generate test cases exhibiting high diversity, realism, and intersectionality. It further introduces tone-aware metamorphic relations to detect fairness violations in model responses. Experimental evaluation on GPT-4 and LLaMA-3 demonstrates fault detection rates of 0.73 and 0.69, respectively—significantly outperforming baseline approaches. The framework also achieves superior syntactic and semantic diversity scores (10.06 and 76.68) and maintains higher response coherence across all metrics.
📝 Abstract
Large Language Models (LLMs) are increasingly deployed in critical domains, yet they often exhibit biases inherited from training data, leading to fairness concerns. This work focuses on the problem of effectively detecting fairness violations, especially intersectional biases that are often missed by existing template-based and grammar-based testing methods. Previous approaches, such as CheckList and ASTRAEA, provide structured or grammar-driven test generation but struggle with low test diversity and limited sensitivity to complex demographic interactions. To address these limitations, we propose GenFair, a metamorphic fairness testing framework that systematically generates source test cases using equivalence partitioning, mutation operators, and boundary value analysis. GenFair improves fairness testing by generating linguistically diverse, realistic, and intersectional test cases. It applies metamorphic relations (MR) to derive follow-up cases and detects fairness violations via tone-based comparisons between source and follow-up responses. In experiments with GPT-4.0 and LLaMA-3.0, GenFair outperformed two baseline methods. It achieved a fault detection rate (FDR) of 0.73 (GPT-4.0) and 0.69 (LLaMA-3.0), compared to 0.54/0.51 for template-based and 0.39/0.36 for ASTRAEA. GenFair also showed the highest test case diversity (syntactic:10.06, semantic: 76.68) and strong coherence (syntactic: 291.32, semantic: 0.7043), outperforming both baselines. These results demonstrate the effectiveness of GenFair in uncovering nuanced fairness violations. The proposed method offers a scalable and automated solution for fairness testing and contributes to building more equitable LLMs.