🤖 AI Summary
This study addresses implicit gender bias in news corpora by proposing a participant-level fairness analysis framework that systematically detects and mitigates structural asymmetries in sentiment polarity, syntactic agency, and quotation practices among actors (i.e., agents). Methodologically, it introduces discourse-structure-aware, fine-grained fairness metrics integrating sentiment analysis, active-voice identification, quotation pattern modeling, and exclusion-based corpus resampling. Evaluated on the *taz2024full* corpus—a longitudinal German newspaper dataset spanning 1980–2024—the approach significantly improves cross-lingual gender representational balance. Crucially, it uncovers persistent affective and discursive framing biases even where surface-level statistical parity holds. The project releases open-source analytical tools and comprehensive evaluation reports, establishing a reproducible and scalable paradigm for fair corpus construction.
📝 Abstract
Large language models are increasingly shaping digital communication, yet their outputs often reflect structural gender imbalances that originate from their training data. This paper presents an extended actor-level pipeline for detecting and mitigating gender discrimination in large-scale text corpora. Building on prior work in discourse-aware fairness analysis, we introduce new actor-level metrics that capture asymmetries in sentiment, syntactic agency, and quotation styles. The pipeline supports both diagnostic corpus analysis and exclusion-based balancing, enabling the construction of fairer corpora. We apply our approach to the taz2024full corpus of German newspaper articles from 1980 to 2024, demonstrating substantial improvements in gender balance across multiple linguistic dimensions. Our results show that while surface-level asymmetries can be mitigated through filtering and rebalancing, subtler forms of bias persist, particularly in sentiment and framing. We release the tools and reports to support further research in discourse-based fairness auditing and equitable corpus construction.