Cost-based Selection of Provenance Sketches for Data Skipping

📅 2025-04-27

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

Horizontal partitioning of tables often suffers from low data-skipping efficiency due to suboptimal selection of partitioning attributes for provenance sketches. Method: This paper proposes a cost-model-driven approach for automatic provenance sketch selection, addressing the core challenge of jointly optimizing sketch space overhead and skipping benefit. It introduces a sample-driven sketch-size estimation technique enabling online, approximate, and lightweight sketch sizing; integrates approximate query processing with fine-grained cost modeling to jointly optimize sketch structure and partitioning attribute choice. Contribution/Results: Experiments show the method achieves over 92% sketch-selection accuracy, reduces end-to-end query latency by up to 60%, and significantly outperforms fixed-partitioning and heuristic baselines—thereby overcoming performance bottlenecks inherent in conventional static sketch deployment.

Technology Category

Application Category

📝 Abstract

Provenance sketches, light-weight indexes that record what data is needed (is relevant) for answering a query, can significantly improve performance of important classes of queries (e.g., HAVING and top-k queries). Given a horizontal partition of a table, a provenance sketch for a query Q records which fragments contain provenance. Once a provenance sketch has been captured for a query, it can be used to speed-up subsequent queries by skipping data that does not belong to a sketch. The size and, thus, also the effectiveness of a provenance sketch is often quite sensitive to the choice of attribute(s) we are partitioning on. In this work, we develop sample-based estimation techniques for the size of provenance sketches akin to a specialized form of approximate query processing. This technique enables the online selection of provenance sketches by estimating the size of sketches for a set of candidate attributes and then creating the sketch that is estimated to yield the largest benefit. We demonstrate experimentally that our estimation is accurate enough to select optimal or near optimal provenance sketches in most cases which in turn leads to a runtime improvement of up to %60 compared to other strategies for selecting provenance sketches.

Problem

Research questions and friction points this paper is trying to address.

Select optimal attributes for partitioning to minimize provenance sketch size

Estimate sketch size efficiently using sample-based techniques for query speed-up

Improve query runtime by up to 60% via cost-based sketch selection

Innovation

Methods, ideas, or system contributions that make the work stand out.

Provenance sketches skip irrelevant data fragments

Sample-based estimation for sketch size selection

Online selection optimizes sketch effectiveness

🔎 Similar Papers

The Ubiquitous Skiplist: A Survey of What Cannot be Skipped About the Skiplist and its Applications in Big Data Systems