ProSPy: A Profiling-Driven SQL-Python Agentic Framework for Enterprise Text-to-SQL

📅 2026-06-04
📈 Citations: 0
Influential: 0
📄 PDF

career value

162K/year
🤖 AI Summary
Existing text-to-SQL approaches struggle with enterprise databases due to schema heterogeneity, missing metadata, diverse SQL dialects, and complex analytical tasks. To address these challenges, this work proposes ProSPy, a novel framework that introduces data-probing-driven proxy reasoning into the domain for the first time. ProSPy automatically extracts fine-grained evidence through data probing, progressively prunes the schema to retain only task-relevant elements, generates dialect-agnostic intermediate views via a unified SQL interface, and leverages Python for flexible downstream analysis. This approach substantially reduces reliance on metadata and enhances robustness across SQL dialects, achieving execution accuracies of 60.15% and 60.51% on Spider 2.0-Lite and Spider 2.0-Snow, respectively—significantly outperforming strong baselines.
📝 Abstract
Large language models have substantially advanced Text-to-SQL systems, yet applying them to enterprise-scale databases remains challenging. Real-world databases often contain large and heterogeneous schemas, incomplete metadata, dialect-specific SQL syntax, and complex analytical questions that are difficult to solve with a single SQL query. To address these challenges, we propose ProSPy, a Profiling-driven SQL--Python agentic framework for enterprise-scale Text-to-SQL. ProSPy structures the reasoning process into four stages: it first extracts fine-grained data evidence through automatic profiling, progressively prunes large schemas into task-relevant contexts, fetches intermediate views through a dialect-agnostic SQL interface, and finally performs flexible downstream analysis with Python. This design combines the efficiency of SQL over large databases with the flexibility of Python-based analysis, while reducing reliance on unreliable metadata and improving robustness across SQL dialects. Experiments on Spider 2.0-Lite and Spider 2.0-Snow show that ProSPy consistently outperforms strong baselines with both open-source and proprietary models, achieving execution accuracies of 60.15% and 60.51% with Claude-4.5-Opus, without majority voting. Further analysis shows that ProSPy is robust to SQL dialect variations and achieves a favorable trade-off between schema recall and precision.
Problem

Research questions and friction points this paper is trying to address.

Text-to-SQL
enterprise databases
schema heterogeneity
SQL dialects
complex analytical queries
Innovation

Methods, ideas, or system contributions that make the work stand out.

Profiling-driven
SQL-Python Agentic Framework
Enterprise Text-to-SQL
Schema Pruning
Dialect-agnostic SQL
🔎 Similar Papers