Downsizing Diffusion Models for Cardinality Estimation

📅 2025-10-23

📈 Citations: 0

✨ Influential: 0

career value

166K/year

🤖 AI Summary

To address the high storage overhead, elevated query latency, and insufficient accuracy in joint-distribution cardinality estimation, this paper introduces ADC and ADC+, the first approaches to employ lightweight diffusion models for this task. Our key contributions are: (1) an efficient point-density estimator leveraging a lightweight score-based approximator of the log-likelihood gradient, combined with log-likelihood integration; (2) enhanced range-query selectivity prediction via integration of Gaussian mixture models with importance-sampling Monte Carlo; and (3) automatic identification of high-confidence regions using decision trees to skip redundant correction steps, substantially reducing both error and latency. Evaluated on real-world and synthetic datasets, ADC+ achieves a 10× reduction in 95th/99th percentile error over Naru, cuts storage by 34%, lowers latency by 25%, doubles throughput, and maintains robustness under strong attribute correlations.

Technology Category

Application Category

📝 Abstract

Inspired by the performance of score-based diffusion models in estimating complex text, video, and image distributions with thousands of dimensions, we introduce Accelerated Diffusion Cardest (ADC), the first joint distribution cardinality estimator based on a downsized diffusion model. To calculate the pointwise density value of data distributions, ADC's density estimator uses a formula that evaluates log-likelihood by integrating the score function, a gradient mapping which ADC has learned to efficiently approximate using its lightweight score estimator. To answer ranged queries, ADC's selectivity estimator first predicts their selectivity using a Gaussian Mixture Model (GMM), then uses importance sampling Monte Carlo to correct its predictions with more accurate pointwise density values calculated by the density estimator. ADC+ further trains a decision tree to identify the high-volume, high-selectivity queries that the GMM alone can predict very accurately, in which case it skips the correction phase to prevent Monte Carlo from adding more variance. Doing so lowers median Q-error and cuts per-query latency by 25 percent, making ADC+ usually twice as fast as Naru, arguably the state-of-the-art joint distribution cardinality estimator. Numerical experiments using well-established benchmarks show that on all real-world datasets tested, ADC+ is capable of rivaling Naru and outperforming MSCN, DeepDB, LW-Tree, and LW-NN using around 66 percent their storage space, being at least 3 times as accurate as MSCN on 95th and 99th percentile error. Furthermore, on a synthetic dataset where attributes exhibit complex, multilateral correlations, ADC and ADC+ are considerably robust while almost every other learned model suffered significant accuracy declines. In this case, ADC+ performs better than any other tested model, being 10 times as accurate as Naru on 95th and 99th percentile error.

Problem

Research questions and friction points this paper is trying to address.

Develop downsized diffusion models for cardinality estimation

Estimate joint data distribution using lightweight score approximation

Improve query accuracy and speed with selective Monte Carlo correction

Innovation

Methods, ideas, or system contributions that make the work stand out.

Downsized diffusion model for cardinality estimation

Score function integration for density estimation

GMM with Monte Carlo correction for query selectivity

🔎 Similar Papers

Towards One Model for Classical Dimensionality Reduction: A Probabilistic Perspective on UMAP and t-SNE