Dirty Data in the Newsroom: Comparing Data Preparation in Journalism and Data Science

📅 2025-07-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the longstanding neglect of data preparation in data journalism—particularly relative to data science—by systematically investigating the challenges journalists face with dirty data. Drawing on in-depth interviews with 36 professional data journalists and adapting the data science workflow model, we propose a novel definition of dirty data centered on “differences in cognitive models,” and identify four journalism-specific data challenges: diachronicity, regional specificity, fragmentation, and heterogeneity. Using a mixed thematic analysis approach (combining deductive and inductive coding), we integrate 16 existing taxonomies to construct a comprehensive dirty data classification framework encompassing 60 distinct issues. The study extends the data science workflow model and, for the first time, provides a theoretically grounded yet practice-oriented framework for data preparation in data journalism—bridging conceptual rigor with empirical relevance.

Technology Category

Application Category

📝 Abstract
The work involved in gathering, wrangling, cleaning, and otherwise preparing data for analysis is often the most time consuming and tedious aspect of data work. Although many studies describe data preparation within the context of data science workflows, there has been little research on data preparation in data journalism. We address this gap with a hybrid form of thematic analysis that combines deductive codes derived from existing accounts of data science workflows and inductive codes arising from an interview study with 36 professional data journalists. We extend a previous model of data science work to incorporate detailed activities of data preparation. We synthesize 60 dirty data issues from 16 taxonomies on dirty data and our interview data, and we provide a novel taxonomy to characterize these dirty data issues as discrepancies between mental models. We also identify four challenges faced by journalists: diachronic, regional, fragmented, and disparate data sources.
Problem

Research questions and friction points this paper is trying to address.

Comparing data preparation in journalism vs data science
Addressing lack of research on data journalism workflows
Identifying dirty data issues and journalist challenges
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hybrid thematic analysis combining deductive and inductive codes
Extended data science model with detailed preparation activities
Novel taxonomy for dirty data as mental model discrepancies
S
Stephen Kasica
The University of British Columbia, Vancouver, BC, Canada
C
Charles Berret
Linköping University, Norrköping, Sweden
Tamara Munzner
Tamara Munzner
Professor of Computer Science, University of British Columbia
Visualizationinformation visualizationvisual analytics