🤖 AI Summary
This paper addresses the challenge of simultaneously achieving expressive provenance representation, efficient probabilistic inference, and scalability in probabilistic databases. We propose a general-purpose, provably correct probabilistic database system built atop PostgreSQL. Our system supports full relational algebra, bag semantics, and end-user aggregation, enabling fine-grained provenance tracking and exact probability computation for query results. To reconcile theoretical expressiveness with practical performance, we introduce a novel in-memory-mapped file storage mechanism for generic provenance circuits. Furthermore, we design a unified data model and SQL-compatible query interface that seamlessly integrates provenance and probabilistic reasoning. Experimental evaluation on multiple benchmarks demonstrates low overhead, strong horizontal and vertical scalability, and competitive performance against state-of-the-art specialized systems—validating its readiness for large-scale production deployment.
📝 Abstract
We present the data model, design choices, and performance of ProvSQL, a general and easy-to-deploy provenance tracking and probabilistic database system implemented as a PostgreSQL extension. ProvSQL's data and query models closely reflect that of a large core of SQL, including multiset semantics, the full relational algebra, and terminal aggregation. A key part of its implementation relies on generic provenance circuits stored in memory-mapped files. We propose benchmarks to measure the overhead of provenance and probabilistic evaluation and demonstrate its scalability and competitiveness with respect to other state-of-the-art systems.