🤖 AI Summary
To address the high storage overhead, elevated query latency, and insufficient accuracy in joint-distribution cardinality estimation, this paper introduces ADC and ADC+, the first approaches to employ lightweight diffusion models for this task. Our key contributions are: (1) an efficient point-density estimator leveraging a lightweight score-based approximator of the log-likelihood gradient, combined with log-likelihood integration; (2) enhanced range-query selectivity prediction via integration of Gaussian mixture models with importance-sampling Monte Carlo; and (3) automatic identification of high-confidence regions using decision trees to skip redundant correction steps, substantially reducing both error and latency. Evaluated on real-world and synthetic datasets, ADC+ achieves a 10× reduction in 95th/99th percentile error over Naru, cuts storage by 34%, lowers latency by 25%, doubles throughput, and maintains robustness under strong attribute correlations.
📝 Abstract
Inspired by the performance of score-based diffusion models in estimating complex text, video, and image distributions with thousands of dimensions, we introduce Accelerated Diffusion Cardest (ADC), the first joint distribution cardinality estimator based on a downsized diffusion model.
To calculate the pointwise density value of data distributions, ADC's density estimator uses a formula that evaluates log-likelihood by integrating the score function, a gradient mapping which ADC has learned to efficiently approximate using its lightweight score estimator. To answer ranged queries, ADC's selectivity estimator first predicts their selectivity using a Gaussian Mixture Model (GMM), then uses importance sampling Monte Carlo to correct its predictions with more accurate pointwise density values calculated by the density estimator. ADC+ further trains a decision tree to identify the high-volume, high-selectivity queries that the GMM alone can predict very accurately, in which case it skips the correction phase to prevent Monte Carlo from adding more variance. Doing so lowers median Q-error and cuts per-query latency by 25 percent, making ADC+ usually twice as fast as Naru, arguably the state-of-the-art joint distribution cardinality estimator.
Numerical experiments using well-established benchmarks show that on all real-world datasets tested, ADC+ is capable of rivaling Naru and outperforming MSCN, DeepDB, LW-Tree, and LW-NN using around 66 percent their storage space, being at least 3 times as accurate as MSCN on 95th and 99th percentile error. Furthermore, on a synthetic dataset where attributes exhibit complex, multilateral correlations, ADC and ADC+ are considerably robust while almost every other learned model suffered significant accuracy declines. In this case, ADC+ performs better than any other tested model, being 10 times as accurate as Naru on 95th and 99th percentile error.