MLSkip: Data Skipping for ML Filters via Lightweight Metadata

📅 2026-06-02
📈 Citations: 0
Influential: 0
📄 PDF

career value

173K/year
🤖 AI Summary
This study addresses the inefficiency of traditional data skipping techniques when applied to database filters based on black-box machine learning models. To bridge this gap, the work introduces the first extension of data skipping to ML-based filtering, proposing a lightweight augmented metadata structure grounded in bounded two-dimensional convex hulls. This structure synergistically integrates Parquet’s native min-max statistics, ML query semantics, and neural network verification techniques—including ReLU architecture analysis—to achieve substantially enhanced pruning efficacy with minimal storage overhead. Experimental evaluation on TPC-H and TPC-DS benchmarks under low-selectivity (<0.1%) query workloads demonstrates that the approach improves average pruning rates from 27.4% to 38.31% and accelerates end-to-end query execution by 1.07× compared to a PyTorch-integrated DuckDB baseline.
📝 Abstract
Database vendors recently released AI functions that can be used in filter predicates. As such functions often rely on costly, black-box ML models, they unveil new data management challenges. Concretely, traditional data skipping techniques for integer and string data fail to be applicable to the new filter type. Indeed, there is no known mechanism for pruning non-qualifying row groups, e.g., when reading files from blob storage. In this work, we initiate the study of data skipping techniques for ML filters. We make the case that Parquet's default min-max metadata is enough to enable pruning. To this end, we draw connections to two lines of research: (i) the recently proposed query language for ML models and (ii) neural network verification. Our preliminary results on ReLU architectures show that on tables from TPC-H and TPC-DS, the average pruning effectiveness for filters of selectivity below 0.1% amounts to 27.4%. Finally, inspired by research on spatial joins, we propose an enhanced metadata structure: a size-bounded 2D convex hull that verification tools can make better use of, increasing the pruning effectiveness to 38.31%, while occupying at most 45 bytes per row group and column pair. We observe an end-to-end speedup of 1.07$\times$ over PyTorch in DuckDB.
Problem

Research questions and friction points this paper is trying to address.

data skipping
ML filters
metadata
query pruning
black-box ML models
Innovation

Methods, ideas, or system contributions that make the work stand out.

data skipping
ML filters
neural network verification
convex hull metadata
query optimization
🔎 Similar Papers
No similar papers found.