🤖 AI Summary
This work addresses the critical gap that while AI-generated SQL queries may be semantically correct, they often violate data privacy and security policies, and current database systems lack fine-grained data flow control. The paper proposes Data Flow Control (DFC), a framework that embeds tuple-level security policies directly into the database infrastructure for the first time. DFC employs a declarative policy language to specify constraints and formalizes the security of aggregate predicates using provenance monomials. Through the Passant query rewriting layer, DFC enables optimizer-agnostic, cross-DBMS-compatible policy enforcement with zero runtime overhead, without materializing provenance data. Experiments demonstrate that Passant incurs near 0% performance overhead across DuckDB, Umbra, PostgreSQL, DataFusion, and SQL Server, outperforming existing approaches by several orders of magnitude, thereby shifting data security from prompt engineering to native infrastructure guarantees.
📝 Abstract
Agents increasingly generate SQL, orchestrate pipelines, and automate data analysis on behalf of users. While recent work improves query correctness, correctness is not safety. A query may be semantically valid yet violate regulatory, privacy, or business constraints that govern how data may be combined and released. We argue that enforcing such constraints is fundamentally a data infrastructure problem.
This paper introduces Data Flow Control (DFC), a framework to declaratively specify and guarantee policy enforcement over tuple-level data flows within a DBMS query. A key challenge is defining a policy language that is optimizer-invariant yet efficient to enforce at scale. We formalize data safety as aggregate predicates over provenance monomials and present Passant, a portable query rewriting layer that enforces DFC policies without materializing provenance. Across five DBMS engines -- DuckDB, Umbra, PostgreSQL, DataFusion, and SQLServer -- Passant achieves ~0% overhead and outperforms alternatives by orders of magnitude. As a result, Data Flow Control is the first step towards moving data safety from prompts and post-hoc checks into the data infrastructure. Data Flow Control is available open source at https://github.com/dataflowcontrol/data-flow-control.