🤖 AI Summary
Cloud data platforms struggle to reproduce performance regressions due to inaccessibility of tenants’ raw data. This work proposes ScanTwin, a lightweight framework that, for the first time, focuses on key physical properties influencing query engine scan behavior—such as row-group pruning—by extracting metadata (e.g., row-group boundary values and compressed sizes) from Parquet file footers and generating high-fidelity synthetic data under ε-differential privacy. Experimental results on TPC-H and SSB benchmarks show that with no privacy noise (ε=∞), ScanTwin achieves 0% pruning error and less than 1% byte-level error. Even under strong privacy guarantees (ε=5), pruning error for highly selective queries remains below 8.5%, and DuckDB scan latencies closely match those on the original data.
📝 Abstract
In cloud data platforms, developers often encounter performance regressions that occur in specific tenant datasets. However, due to confidentiality constraints, they cannot access the original data, which makes it difficult to reproduce these regressions locally. Current methods for synthetic data usually focus on statistical properties, such as matching data distributions or improving query accuracy. However, they overlook the physical properties that control how the engine behaves during scans, including row-group pruning.
We propose ScanTwin, a lightweight framework that extracts a per-row-group sketch from the Parquet footer, including boundary values and compressed sizes, and releases them under $\varepsilon$-differential privacy using a boundary parameterization. On TPC-H and SSB (6M rows), ScanTwin achieves 0% pruning error and less than 1% byte error at $\varepsilon{=}\infty$. Under $\varepsilon{=}5$, high-selectivity queries ($>$30%) incur below 8.5% pruning error on both datasets, and per-query scan timing on DuckDB closely tracks the original.