FairCode: Evaluating Social Bias of LLMs in Code Generation

📅 2025-01-09

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

This work systematically investigates implicit societal biases—such as gender and racial biases—in large language models (LLMs) during code generation, addressing critical fairness and safety concerns. Method: We introduce FairCode, the first benchmark for assessing societal bias in programming contexts, covering function implementation and test-case generation. Our semantic-aware, scenario-diverse evaluation framework defines FairScore—a comprehensive quantitative metric—and integrates prompt engineering to construct bias-sensitive coding tasks, validated via human annotation and automated assessment across multiple LLMs. Contribution/Results: Empirical evaluation on mainstream open- and closed-source LLMs reveals pervasive societal bias across all models. To foster reproducibility and community advancement, we publicly release the FairCode dataset and evaluation toolkit, establishing a foundation for fair code generation research.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) have demonstrated significant capability in code generation, drawing increasing attention to the evaluation of the quality and safety of their outputs. However, research on bias in code generation remains limited. Existing studies typically assess bias by applying malicious prompts or reapply tasks and dataset for discriminative models. Given that LLMs are often aligned with human values and that prior datasets are not fully optimized for code-related tasks, there is a pressing need for benchmarks specifically designed for evaluating code models. In this study, we introduce FairCode, a novel benchmark for evaluating bias in code generation. FairCode comprises two tasks: function implementation and test case generation, each evaluating social bias through diverse scenarios. Additionally, we propose a new metric, FairScore, to assess model performance on this benchmark. We conduct experiments on widely used LLMs and provide a comprehensive analysis of the results. The findings reveal that all tested LLMs exhibit bias. The code is available at https://github.com/YongkDu/FairCode.

Problem

Research questions and friction points this paper is trying to address.

Language Models

Bias

Code Generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

FairCode

Bias Detection

FairScore

🔎 Similar Papers

Bias Testing and Mitigation in LLM-based Code Generation

2023-09-03ACM Transactions on Software Engineering and MethodologyCitations: 17

LangBiTe: A Platform for Testing Bias in Large Language Models

2024-04-29arXiv.orgCitations: 2

Apple

San Francisco, United States of America

Research Scientist Intern, Multimodal AI (PhD)