ProvSQL: A General System for Keeping Track of the Provenance and Probability of Data

📅 2025-04-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses the challenge of simultaneously achieving expressive provenance representation, efficient probabilistic inference, and scalability in probabilistic databases. We propose a general-purpose, provably correct probabilistic database system built atop PostgreSQL. Our system supports full relational algebra, bag semantics, and end-user aggregation, enabling fine-grained provenance tracking and exact probability computation for query results. To reconcile theoretical expressiveness with practical performance, we introduce a novel in-memory-mapped file storage mechanism for generic provenance circuits. Furthermore, we design a unified data model and SQL-compatible query interface that seamlessly integrates provenance and probabilistic reasoning. Experimental evaluation on multiple benchmarks demonstrates low overhead, strong horizontal and vertical scalability, and competitive performance against state-of-the-art specialized systems—validating its readiness for large-scale production deployment.

Technology Category

Application Category

📝 Abstract
We present the data model, design choices, and performance of ProvSQL, a general and easy-to-deploy provenance tracking and probabilistic database system implemented as a PostgreSQL extension. ProvSQL's data and query models closely reflect that of a large core of SQL, including multiset semantics, the full relational algebra, and terminal aggregation. A key part of its implementation relies on generic provenance circuits stored in memory-mapped files. We propose benchmarks to measure the overhead of provenance and probabilistic evaluation and demonstrate its scalability and competitiveness with respect to other state-of-the-art systems.
Problem

Research questions and friction points this paper is trying to address.

Track data provenance and probability efficiently
Implement scalable PostgreSQL extension for SQL
Benchmark overhead of provenance and probabilistic evaluation
Innovation

Methods, ideas, or system contributions that make the work stand out.

PostgreSQL extension for provenance tracking
Generic provenance circuits in memory-mapped files
Supports full relational algebra and aggregation
🔎 Similar Papers
No similar papers found.