taz2024full: Analysing German Newspapers for Gender Bias and Discrimination across Decades

📅 2025-06-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the longstanding scarcity of large-scale, temporally complete, open corpora for computational social science research on German, which has hindered empirical investigations into gender bias and language change. To bridge this gap, we construct the largest publicly available German newspaper corpus to date—comprising over 1.8 million articles from the daily newspaper *taz* (1980–2024). We introduce a novel, high-temporal-resolution, full-coverage news corpus construction paradigm, integrating OCR correction and metadata standardization. Furthermore, we design an extensible, structured analytical pipeline incorporating named entity recognition, sentiment analysis, and discourse co-occurrence modeling to jointly measure diachronic subject representation and bias. Empirical analysis reveals persistent male overrepresentation, followed by a significant convergence in gender representation after 2014. Both the corpus and the analytical framework are open-sourced, establishing critical infrastructure for fair, reproducible German NLP research.

Technology Category

Application Category

📝 Abstract
Open-access corpora are essential for advancing natural language processing (NLP) and computational social science (CSS). However, large-scale resources for German remain limited, restricting research on linguistic trends and societal issues such as gender bias. We present taz2024full, the largest publicly available corpus of German newspaper articles to date, comprising over 1.8 million texts from taz, spanning 1980 to 2024. As a demonstration of the corpus's utility for bias and discrimination research, we analyse gender representation across four decades of reporting. We find a consistent overrepresentation of men, but also a gradual shift toward more balanced coverage in recent years. Using a scalable, structured analysis pipeline, we provide a foundation for studying actor mentions, sentiment, and linguistic framing in German journalistic texts. The corpus supports a wide range of applications, from diachronic language analysis to critical media studies, and is freely available to foster inclusive and reproducible research in German-language NLP.
Problem

Research questions and friction points this paper is trying to address.

Analyzing gender bias in German newspapers over decades
Addressing limited large-scale German language resources for NLP
Providing tools for studying actor mentions and sentiment
Innovation

Methods, ideas, or system contributions that make the work stand out.

Largest German newspaper corpus for gender bias analysis
Scalable pipeline for actor mentions and sentiment analysis
Open-access resource for inclusive German-language NLP research
🔎 Similar Papers
No similar papers found.
S
Stefanie Urchs
Faculty for Computer Science and Mathematics, Hochschule München University of Applied Sciences; Department of Statistics, LMU Munich
V
Veronika Thurner
Faculty for Computer Science and Mathematics, Hochschule München University of Applied Sciences
M
Matthias Assenmacher
Department of Statistics, LMU Munich; Munich Center for Machine Learning (MCML), LMU Munich
Christian Heumann
Christian Heumann
Professor Statistik, Ludwig-Maximilians-Universität München
Statistik
S
Stephanie Thiemichen
Faculty for Computer Science and Mathematics, Hochschule München University of Applied Sciences