Is it Bigger than a Breadbox: Efficient Cardinality Estimation for Real World Workloads

📅 2025-10-03

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

Cardinality estimation for complex database queries suffers from significant error under traditional heuristic methods, while state-of-the-art learned estimators incur high operational complexity and deployment overhead. This paper proposes a lightweight online learning-based cardinality estimator: it identifies recurring subquery patterns via structural hashing of subquery graphs and dynamically constructs and updates localized regression models per pattern class, enabling low-overhead, high-accuracy real-time estimation. Its key innovation lies in integrating graph-structural hashing with online regression—eliminating the need for offline training or auxiliary infrastructure, yielding a compact, plug-and-play model. Evaluated on the JOB-lite (IMDb) benchmark, our method reduces total query execution time by over 7.5 minutes, incurs only 37 seconds of online learning overhead, achieves accuracy competitive with SOTA learned estimators, and substantially outperforms conventional heuristic approaches.

Technology Category

Application Category

📝 Abstract

DB engines produce efficient query execution plans by relying on cost models. Practical implementations estimate cardinality of queries using heuristics, with magic numbers tuned to improve average performance on benchmarks. Empirically, estimation error significantly grows with query complexity. Alternatively, learning-based estimators offer improved accuracy, but add operational complexity preventing their adoption in-practice. Recognizing that query workloads contain highly repetitive subquery patterns, we learn many simple regressors online, each localized to a pattern. The regressor corresponding to a pattern can be randomly-accessed using hash of graph structure of the subquery. Our method has negligible overhead and competes with SoTA learning-based approaches on error metrics. Further, amending PostgreSQL with our method achieves notable accuracy and runtime improvements over traditional methods and drastically reduces operational costs compared to other learned cardinality estimators, thereby offering the most practical and efficient solution on the Pareto frontier. Concretely, simulating JOB-lite workload on IMDb speeds-up execution by 7.5 minutes (>30%) while incurring only 37 seconds overhead for online learning.

Problem

Research questions and friction points this paper is trying to address.

Efficient cardinality estimation for database query workloads

Reducing estimation error growth with query complexity

Minimizing operational costs of learning-based cardinality estimators

Innovation

Methods, ideas, or system contributions that make the work stand out.

Learns simple regressors for repetitive subquery patterns

Uses hash of subquery graph structure for random access

Integrates into PostgreSQL with negligible operational overhead

🔎 Similar Papers

Color: A Framework for Applying Graph Coloring to Subgraph Cardinality Estimation