Downsizing Diffusion Models for Cardinality Estimation

📅 2025-10-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the high storage overhead, elevated query latency, and insufficient accuracy in joint-distribution cardinality estimation, this paper introduces ADC and ADC+, the first approaches to employ lightweight diffusion models for this task. Our key contributions are: (1) an efficient point-density estimator leveraging a lightweight score-based approximator of the log-likelihood gradient, combined with log-likelihood integration; (2) enhanced range-query selectivity prediction via integration of Gaussian mixture models with importance-sampling Monte Carlo; and (3) automatic identification of high-confidence regions using decision trees to skip redundant correction steps, substantially reducing both error and latency. Evaluated on real-world and synthetic datasets, ADC+ achieves a 10× reduction in 95th/99th percentile error over Naru, cuts storage by 34%, lowers latency by 25%, doubles throughput, and maintains robustness under strong attribute correlations.

Technology Category

Application Category

📝 Abstract
Inspired by the performance of score-based diffusion models in estimating complex text, video, and image distributions with thousands of dimensions, we introduce Accelerated Diffusion Cardest (ADC), the first joint distribution cardinality estimator based on a downsized diffusion model. To calculate the pointwise density value of data distributions, ADC's density estimator uses a formula that evaluates log-likelihood by integrating the score function, a gradient mapping which ADC has learned to efficiently approximate using its lightweight score estimator. To answer ranged queries, ADC's selectivity estimator first predicts their selectivity using a Gaussian Mixture Model (GMM), then uses importance sampling Monte Carlo to correct its predictions with more accurate pointwise density values calculated by the density estimator. ADC+ further trains a decision tree to identify the high-volume, high-selectivity queries that the GMM alone can predict very accurately, in which case it skips the correction phase to prevent Monte Carlo from adding more variance. Doing so lowers median Q-error and cuts per-query latency by 25 percent, making ADC+ usually twice as fast as Naru, arguably the state-of-the-art joint distribution cardinality estimator. Numerical experiments using well-established benchmarks show that on all real-world datasets tested, ADC+ is capable of rivaling Naru and outperforming MSCN, DeepDB, LW-Tree, and LW-NN using around 66 percent their storage space, being at least 3 times as accurate as MSCN on 95th and 99th percentile error. Furthermore, on a synthetic dataset where attributes exhibit complex, multilateral correlations, ADC and ADC+ are considerably robust while almost every other learned model suffered significant accuracy declines. In this case, ADC+ performs better than any other tested model, being 10 times as accurate as Naru on 95th and 99th percentile error.
Problem

Research questions and friction points this paper is trying to address.

Develop downsized diffusion models for cardinality estimation
Estimate joint data distribution using lightweight score approximation
Improve query accuracy and speed with selective Monte Carlo correction
Innovation

Methods, ideas, or system contributions that make the work stand out.

Downsized diffusion model for cardinality estimation
Score function integration for density estimation
GMM with Monte Carlo correction for query selectivity
🔎 Similar Papers
No similar papers found.
X
Xinhe Mu
Academy of Mathematics and Systems Sciences, Chinese Academy of Sciences, Beijing, China
Z
Zhaoqi Zhou
Huawei Technologies Co., Ltd., Beijing, China
Z
Zaijiu Shang
Center for Mathematics and Interdisciplinary Sciences at Fudan University, Shanghai Institute for Mathematics and Interdisciplinary Sciences, Shanghai, China
C
Chuan Zhou
Academy of Mathematics and Systems Sciences, Chinese Academy of Sciences, Beijing, China
Gang Fu
Gang Fu
Amazon
Machine LearningDeep LearningSemantic Network Analysis
G
Guiying Yan
Academy of Mathematics and Systems Sciences, Chinese Academy of Sciences, Beijing, China
Guoliang Li
Guoliang Li
Professor, Tsinghua University
DatabaseBig DataCrowdsourcingData Cleaning & Integration
Z
Zhiming Ma
Academy of Mathematics and Systems Sciences, Chinese Academy of Sciences, Beijing, China