๐ค AI Summary
Existing AI-generated text detectors are widely deployed in educational and professional settings, yet they lack systematic evaluation for sociolinguistic biasโparticularly against English language learners (ELLs), speakers of non-mainstream dialects, and individuals with lower educational attainment.
Method: We introduce BAID, the first benchmark for auditing detector bias, encompassing seven sociolinguistic dimensions and over 200,000 real and controllably synthesized samples. We propose a subgroup-aware text synthesis method that injects stylistic biases while preserving semantic content, and design a multidimensional, scalable bias auditing framework.
Contribution/Results: Evaluating four state-of-the-art open-source detectors reveals statistically significant (p < 0.01) average recall drops of 12.7โ38.4 percentage points for ELLs, non-mainstream dialect speakers, and low-education groups. We publicly release both the audit toolkit and the BAID dataset to advance standardized fairness assessment in AI-generated text detection.
๐ Abstract
AI-generated text detectors have recently gained adoption in educational and professional contexts. Prior research has uncovered isolated cases of bias, particularly against English Language Learners (ELLs) however, there is a lack of systematic evaluation of such systems across broader sociolinguistic factors. In this work, we propose BAID, a comprehensive evaluation framework for AI detectors across various types of biases. As a part of the framework, we introduce over 200k samples spanning 7 major categories: demographics, age, educational grade level, dialect, formality, political leaning, and topic. We also generated synthetic versions of each sample with carefully crafted prompts to preserve the original content while reflecting subgroup-specific writing styles. Using this, we evaluate four open-source state-of-the-art AI text detectors and find consistent disparities in detection performance, particularly low recall rates for texts from underrepresented groups. Our contributions provide a scalable, transparent approach for auditing AI detectors and emphasize the need for bias-aware evaluation before these tools are deployed for public use.