SpotIt: Evaluating Text-to-SQL Evaluation with Formal Verification

📅 2025-10-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current Text-to-SQL evaluation relies on static test databases, leading to false positives when syntactically distinct but semantically equivalent SQL queries yield identical execution results—thereby inflating accuracy estimates. To address this, we propose SpotIt, the first evaluation paradigm integrating formal bounded equivalence verification into Text-to-SQL assessment. SpotIt actively synthesizes minimal counterexample databases that distinguish semantic discrepancies between generated and reference SQL queries—exposing implicit semantic deviations masked by conventional result-set matching. We extend existing verifiers to support a rich subset of SQL (including joins, aggregations, and nested subqueries) by unifying constraint solving with concrete instance generation for efficient equivalence checking. Evaluating ten state-of-the-art models on the BIRD benchmark, SpotIt reveals an average accuracy overestimation of 12.7%, demonstrating its effectiveness and necessity in assessing true semantic correctness.

Technology Category

Application Category

📝 Abstract
Community-driven Text-to-SQL evaluation platforms play a pivotal role in tracking the state of the art of Text-to-SQL performance. The reliability of the evaluation process is critical for driving progress in the field. Current evaluation methods are largely test-based, which involves comparing the execution results of a generated SQL query and a human-labeled ground-truth on a static test database. Such an evaluation is optimistic, as two queries can coincidentally produce the same output on the test database while actually being different. In this work, we propose a new alternative evaluation pipeline, called SpotIt, where a formal bounded equivalence verification engine actively searches for a database that differentiates the generated and ground-truth SQL queries. We develop techniques to extend existing verifiers to support a richer SQL subset relevant to Text-to-SQL. A performance evaluation of ten Text-to-SQL methods on the high-profile BIRD dataset suggests that test-based methods can often overlook differences between the generated query and the ground-truth. Further analysis of the verification results reveals a more complex picture of the current Text-to-SQL evaluation.
Problem

Research questions and friction points this paper is trying to address.

Evaluating Text-to-SQL systems with formal verification
Detecting semantic differences between generated and ground-truth queries
Addressing limitations of optimistic test-based evaluation methods
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses formal bounded equivalence verification engine
Extends existing verifiers for richer SQL subset
Actively searches databases to differentiate queries
🔎 Similar Papers
No similar papers found.
R
Rocky Klopfenstein
Amherst College
Y
Yang He
Simon Fraser University
A
Andrew Tremante
Amherst College
Yuepeng Wang
Yuepeng Wang
Simon Fraser University
Programming LanguagesProgram SynthesisProgram VerificationDatabases
Nina Narodytska
Nina Narodytska
VMware Research
Artificial IntelligenceOptimization
H
Haoze Wu
Amherst College,VMware Research by Broadcom