CAPER: Clause-Aligned Process Supervision for Text-to-SQL

📅 2026-06-02

📈 Citations: 0

✨ Influential: 0

career value

128K/year

🤖 AI Summary

This work addresses the limitations of existing Text-to-SQL evaluation paradigms, which rely on query-level execution accuracy and thus struggle to provide fine-grained error localization, while token-level supervision suffers from semantic misalignment and high annotation costs. The authors propose a novel clause-aligned process supervision mechanism based on SQL abstract syntax trees, leveraging counterfactual interventions to automatically generate clause-level supervision signals. These signals train a lightweight Clause-PRM reward model that delivers precise, boundary-aware feedback for policy optimization and candidate verification. Evaluated on BIRD and Spider benchmarks, the approach achieves up to a 15.3% improvement in execution accuracy over GPT-5.4, with an error localization accuracy of 84.53% and an MRR of 90.60%, substantially enhancing both model interpretability and performance.

📝 Abstract

Text-to-SQL systems are typically evaluated by query-level execution correctness, but this terminal signal provides little guidance about which intermediate SQL decision caused success or failure. Token-level dense supervision is also ill-suited: SQL tokens do not align with complete semantic decisions, can penalize execution-equivalent queries, and are difficult to label reliably at scale. We therefore propose CAPER, which automatically derives clause-level supervision via counterfactual intervention on the SQL abstract syntax tree, enabling root-cause error localization for reward modeling; the resulting data is used to train CAPER-9B, a lightweight Clause-PRM that provides clause-boundary feedback for policy optimization and candidate verification. Experiments on BIRD and Spider show that clause-aligned supervision not only improves execution accuracy, achieving up to a 15.3% relative EX improvement over GPT-5.4, but also strengthens failure-localization capability, reaching 84.53% accuracy and 90.60% MRR on held-out failures. Our project page is at https://github.com/banrichard/RL-NL2SQL.

Problem

Research questions and friction points this paper is trying to address.

Text-to-SQL

execution correctness

dense supervision

error localization

semantic decisions

Innovation

Methods, ideas, or system contributions that make the work stand out.

clause-aligned supervision

counterfactual intervention

Text-to-SQL