Privacy-Enhanced Database Synthesis for Benchmark Publishing

📅 2024-05-02

🏛️ Proceedings of the VLDB Endowment

📈 Citations: 0

✨ Influential: 0

career value

220K/year

🤖 AI Summary

Existing DBMS benchmarks rely on synthetic data due to privacy constraints on real user data; however, mainstream differential privacy (DP) data synthesis methods prioritize accuracy for aggregate queries or machine learning tasks, neglecting fidelity in query runtime performance—a critical benchmarking metric. Method: We propose PrivBench, the first DP database synthesis framework explicitly optimized for query latency fidelity. It supports multi-relational, structured data with complex foreign-key dependencies. Its core components include sum-product network (SPN)-based probabilistic modeling, a privacy mechanism tailored to query execution characteristics, and a synthesis algorithm preserving relational structure and statistical dependencies. Results: Under database-level differential privacy, PrivBench significantly improves fidelity in both distributional similarity and query latency similarity compared to state-of-the-art baselines. It achieves this while maintaining high efficiency and scalability across diverse workloads and schema complexities.

Technology Category

Application Category

📝 Abstract

Benchmarking is crucial for evaluating a DBMS, yet existing benchmarks often fail to reflect the varied nature of user workloads. As a result, there is increasing momentum toward creating databases that incorporate real-world user data to more accurately mirror business environments. However, privacy concerns deter users from directly sharing their data, underscoring the importance of creating synthesized databases for benchmarking that also prioritize privacy protection. Differential privacy (DP)-based data synthesis has become a key method for safeguarding privacy when sharing data, but the focus has largely been on minimizing errors in aggregate queries or downstream ML tasks, with less attention given to benchmarking factors like query runtime performance. This paper delves into differentially private database synthesis specifically for benchmark publishing scenarios, aiming to produce a synthetic database whose benchmarking factors closely resemble those of the original data. Introducing PrivBench , an innovative synthesis framework based on sum-product networks (SPNs), we support the synthesis of high-quality benchmark databases that maintain fidelity in both data distribution and query runtime performance while preserving privacy. We validate that PrivBench can ensure database-level DP even when generating multi-relation databases with complex reference relationships. Our extensive experiments show that PrivBench efficiently synthesizes data that maintains privacy and excels in both data distribution similarity and query runtime similarity.

Problem

Research questions and friction points this paper is trying to address.

Synthesizing privacy-protected databases for accurate benchmarking

Addressing privacy concerns in real-world data sharing for DBMS evaluation

Ensuring query runtime performance fidelity in differentially private databases

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses differential privacy for database synthesis

Leverages sum-product networks (SPNs) framework

Ensures privacy and query runtime fidelity

🔎 Similar Papers

A Multi-Faceted Evaluation Framework for Assessing Synthetic Data Generated by Large Language Models