🤖 AI Summary
This study addresses the systemic over-filtering of content related to marginalized groups—such as transgender individuals, women, and Central American populations—by language models during both pretraining and inference stages, which leads to epistemic erasure and epistemic injustice. For the first time, the authors conduct a systematic audit of four pretraining filters and three inference-time safety guardrails, evaluating their handling of texts containing gender and geographic identities. Combining human annotation with vocabulary-based blacklist analysis, they find that 88.5% of filtered pretraining data and 91.3% of blocked inference outputs should have been retained. These results reveal severe misclassification in current content moderation mechanisms, demonstrating how such systems disproportionately suppress expressions tied to marginalized identities.
📝 Abstract
Modern language models rely on pretraining filters to remove undesirable content from training corpora and inference-time guardrails to suppress undesirable outputs during deployment. In this paper, we examine how these filtering and moderation decisions produce forms of epistemic erasure and reveal tensions both across automated systems and between these systems and human judgment. We audit four pretraining filters and three inference-time guardrails on Common Crawl sentences containing gender and regional-origin mentions, together with a manually annotated subset of 500 sentences. Our analysis shows that filtering and guardrail decisions are strongly associated with blocklist-based lexical cues, while frequently failing to flag content containing private information or explicit hate speech. At the same time, marginalized groups, particularly transgender people, women, and Central Americans, are significantly over-flagged across systems. Human annotators, by contrast, would retain 88.5\% of filter-flagged and 91.3\% of guardrail-flagged content, often recognizing representational harms arising from tensions of content removal that current systems fail to capture. Taken together, our findings document a form of epistemic erasure in which mentions of marginalized groups are disproportionately removed before pretraining and additionally suppressed again at inference time.