🤖 AI Summary
Natural language queries in Text-to-SQL often exhibit ambiguity, yielding multiple semantically valid SQL interpretations; conventional execution-based evaluation suffers from high false rejection rates due to its inability to recognize logically equivalent yet syntactically distinct SQL queries.
Method: This paper introduces the first LLM-based discriminative framework for *weak semantic equivalence* in SQL. We formally define weak SQL equivalence, design a schema-aware prompting paradigm and a structured evaluation pipeline, and empirically characterize LLMs’ capabilities and biases in SQL logical reasoning. Our approach integrates SQL parsing, canonicalization, few-shot prompting, and human-in-the-loop verification.
Contribution/Results: Evaluated on Spider and BIRD, our framework achieves 89.2% weak-equivalence discrimination accuracy—outperforming execution matching by 23.5%—and substantially reduces false rejections, establishing a more reliable and semantically grounded evaluation foundation for NL2SQL systems.
📝 Abstract
The rise of Large Language Models (LLMs) has significantly advanced Text-to-SQL (NL2SQL) systems, yet evaluating the semantic equivalence of generated SQL remains a challenge, especially given ambiguous user queries and multiple valid SQL interpretations. This paper explores using LLMs to assess both semantic and a more practical"weak"semantic equivalence. We analyze common patterns of SQL equivalence and inequivalence, discuss challenges in LLM-based evaluation.