🤖 AI Summary
This work addresses the lack of interactive, multi-table-aware data cleaning tools for data journalists integrating multiple independent datasets. The authors propose a browser-based, client-side system that introduces “active table merging” and a declarative vocabulary of Stack/Pack operations, treating multi-table collections as the primary unit of data cleaning for the first time. The system adheres to a schema-first, value-on-demand paradigm and integrates real-time schema previews, context-aware data quality alerts, and recursive tree-based visualizations of operation history. Evaluation through the replication of 17 real-world journalist workflows demonstrates the system’s expressive power, while deployment with four professional journalists confirms its usability for non-programmers and highlights its potential educational value in data journalism.
📝 Abstract
Data journalists routinely integrate records across multiple independently published sources to support accountability reporting, yet no existing interactive wrangling tool treats the collection of tables -- rather than the single table -- as its primary unit of work. We present OpenRoundup, an open-source, browser-based system that enables data journalists to consolidate multiple tables into a single analysis-ready output without writing code. The interface comprises five coordinated panels that implement a schema-first, values-on-demand paradigm with live schema previews, ambient data quality alerts, and a recursive treemap visualization of the evolving operation tree. A client-only architecture powered by DuckDB-WASM runs in the browser, providing strong data privacy guarantees suited to sensitive journalism data. The system introduces two conceptual contributions: eager table consolidation, in which a composite table is assembled early in the wrangling phase via interactive, incremental assembly of multiple source tables; and a declarative vocabulary for table consolidation consisting of two operations, Stack and Pack. We evaluate the system through a replication study in which the authors reproduce 17 published journalist programming workflows using only the interface, and a deployment study with four professional data journalists. The replication study demonstrates expressive coverage of real-world consolidation tasks. The deployment study confirms utility for practitioners who understand joins conceptually but lack the programming skills to execute them, and surfaces an unanticipated secondary value for data journalism education.