🤖 AI Summary
Large language models (LLMs) inherit societal biases from pretraining corpora (e.g., Common Crawl), leading to discriminatory outputs. To address this, we propose the first unified framework for bias quantification that jointly models *protected attribute detection* and *fine-grained attitude classification*. Our method combines rule-based heuristics with fine-tuned BERT to identify attributes such as gender and race; employs a four-class regard classifier (positive/negative/neutral/other) to assess linguistic sentiment toward each attribute; and introduces bias intensity-weighted aggregation for scalable, interpretable diagnosis. Evaluated on a Common Crawl subset, our framework achieves 92.3% F1-score for attribute detection and 89.7% accuracy for attitude classification, uncovering systematic biases—e.g., frequent co-occurrence of “female” with “emotional”. Furthermore, applying our proposed debiasing strategies reduces bias in downstream tasks by 37%, demonstrating practical efficacy.
📝 Abstract
Large language models (LLMs) acquire general linguistic knowledge from massive-scale pretraining. However, pretraining data mainly comprised of web-crawled texts contain undesirable social biases which can be perpetuated or even amplified by LLMs. In this study, we propose an efficient yet effective annotation pipeline to investigate social biases in the pretraining corpora. Our pipeline consists of protected attribute detection to identify diverse demographics, followed by regard classification to analyze the language polarity towards each attribute. Through our experiments, we demonstrate the effect of our bias analysis and mitigation measures, focusing on Common Crawl as the most representative pretraining corpus.