Towards Scalable Visual Data Wrangling via Direct Manipulation

📅 2025-12-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Data wrangling remains a critical bottleneck in data science: existing tools either rely on error-prone manual scripting or opaque, low-controllability black-box automation. This paper introduces Buckaroo, the first direct-manipulation data cleaning system designed specifically for visual charts. Its contributions are fourfold: (1) it pioneers the “visual direct manipulation” paradigm, enabling real-time exploration and correction of missing values, outliers, and type errors directly within visualizations; (2) it supports user-defined detectors and repairers; (3) it integrates differential storage and efficient indexing to enable incremental anomaly detection; and (4) it incorporates provenance tracking and automated script generation, producing reproducible Python code. Buckaroo maintains millisecond-scale responsiveness even on large datasets. Expert evaluation demonstrates significant improvements in cleaning efficiency and effectively bridges the gap between visual insight and programmatic remediation.

Technology Category

Application Category

📝 Abstract
Data wrangling - the process of cleaning, transforming, and preparing data for analysis - is a well-known bottleneck in data science workflows. Existing tools either rely on manual scripting, which is error-prone and hard to debug, or automate cleaning through opaque black-box pipelines that offer limited control. We present Buckaroo, a scalable visual data wrangling system that restructures data preparation as a direct manipulation task over visualizations. Buckaroo enables users to explore and repair data anomalies - such as missing values, outliers, and type mismatches - by interacting directly with coordinated data visualizations. The system extensibly supports user-defined error detectors and wranglers, tracks provenance for undo/redo, and generates reproducible scripts for downstream tasks. Buckaroo maintains efficient indexing data structures and differential storage to localize anomaly detection and minimize recomputation. To demonstrate the applicability of our model, Buckaroo is integrated with the extit{Hopara} pan-and-zoom engine, which enables multi-layered navigation over large datasets without sacrificing interactivity. Through empirical evaluation and an expert review, we show that Buckaroo makes visual data wrangling scalable - bridging the gap between visual inspection and programmable repairs.
Problem

Research questions and friction points this paper is trying to address.

Automates data cleaning via interactive visualizations
Detects and repairs anomalies like missing values and outliers
Scales wrangling for large datasets while maintaining interactivity
Innovation

Methods, ideas, or system contributions that make the work stand out.

Direct manipulation over visualizations for data wrangling
Extensible user-defined error detectors and wranglers
Efficient indexing and differential storage for scalability
🔎 Similar Papers
No similar papers found.