🤖 AI Summary
Horizontal partitioning of tables often suffers from low data-skipping efficiency due to suboptimal selection of partitioning attributes for provenance sketches. Method: This paper proposes a cost-model-driven approach for automatic provenance sketch selection, addressing the core challenge of jointly optimizing sketch space overhead and skipping benefit. It introduces a sample-driven sketch-size estimation technique enabling online, approximate, and lightweight sketch sizing; integrates approximate query processing with fine-grained cost modeling to jointly optimize sketch structure and partitioning attribute choice. Contribution/Results: Experiments show the method achieves over 92% sketch-selection accuracy, reduces end-to-end query latency by up to 60%, and significantly outperforms fixed-partitioning and heuristic baselines—thereby overcoming performance bottlenecks inherent in conventional static sketch deployment.
📝 Abstract
Provenance sketches, light-weight indexes that record what data is needed (is relevant) for answering a query, can significantly improve performance of important classes of queries (e.g., HAVING and top-k queries). Given a horizontal partition of a table, a provenance sketch for a query Q records which fragments contain provenance. Once a provenance sketch has been captured for a query, it can be used to speed-up subsequent queries by skipping data that does not belong to a sketch. The size and, thus, also the effectiveness of a provenance sketch is often quite sensitive to the choice of attribute(s) we are partitioning on. In this work, we develop sample-based estimation techniques for the size of provenance sketches akin to a specialized form of approximate query processing. This technique enables the online selection of provenance sketches by estimating the size of sketches for a set of candidate attributes and then creating the sketch that is estimated to yield the largest benefit. We demonstrate experimentally that our estimation is accurate enough to select optimal or near optimal provenance sketches in most cases which in turn leads to a runtime improvement of up to %60 compared to other strategies for selecting provenance sketches.