Beyond Text-to-SQL: Can LLMs Really Debug Enterprise ETL SQL?

📅 2026-01-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge that large language models (LLMs) struggle to generate correct enterprise-grade ETL SQL code in a single attempt, highlighting the urgent need for specialized evaluation benchmarks. To this end, we propose OurBench—the first benchmark tailored for enterprise SQL reasoning and debugging—which automatically injects realistic syntactic and semantic errors via reverse engineering and establishes an execution-free evaluation framework to enable efficient and scalable model assessment. We evaluate nearly 30 mainstream LLMs, revealing that even the best-performing model (Claude-4-Sonnet) achieves only 36.46% and 32.17% accuracy on syntactic and semantic error repair tasks, respectively, with most models scoring below 20%. These findings underscore the significant limitations of current LLMs in complex enterprise SQL debugging and establish a new paradigm and infrastructure for evaluating SQL debugging capabilities.

Technology Category

Application Category

📝 Abstract
SQL is central to enterprise data engineering, yet generating fully correct SQL code in a single attempt remains difficult, even for experienced developers and advanced text-to-SQL LLMs, often requiring multiple debugging iterations. We introduce OurBench, the first benchmark for enterprise-level SQL reasoning and debugging. Our benchmark is built on two key innovations: (1) an automated construction workflow that uses reverse engineering to systematically inject realistic bugs into large-scale SQL code, enabling scalable and diverse benchmark generation; and (2) an execution-free evaluation framework tailored to enterprise settings, providing fast, accurate, and resource-efficient assessment. OurBench comprises 469 OurBenchSyn queries featuring syntax errors with explicit error messages, and 516 OurBenchSem queries targeting semantic errors in which the code fails to meet user intent. The queries are highly complex, averaging over 140 lines and featuring deep and wide abstract syntax trees. Evaluation of nearly 30 LLMs reveals a substantial performance gap: the best-performing model, Claude-4-Sonnet, achieves only 36.46 percent accuracy on OurBenchSyn and 32.17 percent on OurBenchSem, while most models score below 20 percent. We further explore four solution strategies, identify key challenges, and outline promising directions for enterprise SQL debugging with LLMs.
Problem

Research questions and friction points this paper is trying to address.

SQL debugging
enterprise ETL
large language models
semantic errors
syntax errors
Innovation

Methods, ideas, or system contributions that make the work stand out.

enterprise SQL debugging
automated benchmark construction
execution-free evaluation
LLM reasoning
realistic bug injection
🔎 Similar Papers
No similar papers found.
J
Jing Ye
ByteDance Inc. Beijing, China
Y
Yiwen Duan
ByteDance Inc. Beijing, China
Yonghong Yu
Yonghong Yu
Nanjing University of Posts and Telecommunications
collaborative filteringrecommender system
V
Victor Ma
ByteDance Inc. Beijing, China
Y
Yang Gao
ByteDance Inc. Beijing, China
X
Xing Chen
ByteDance Inc. Beijing, China