Reasoning-SQL: Reinforcement Learning with SQL Tailored Partial Rewards for Reasoning-Enhanced Text-to-SQL

📅 2025-03-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Text-to-SQL faces inference-intensive challenges—including natural language understanding, schema awareness, and precise SQL generation—while sparse rewards in reinforcement learning (RL) severely hinder model optimization. To address this, we propose a fine-grained partial reward framework covering the entire SQL generation pipeline, introducing a novel four-dimensional reward mechanism: database schema linking, LLM self-feedback, n-gram similarity, and SQL syntax validation. Built upon the Group Relative Policy Optimization (GRPO) framework, our approach enables unsupervised internalization of structured reasoning capabilities, eliminating reliance on manually annotated reasoning traces. Evaluated on the BIRD benchmark, our RL-only trained 14B model outperforms o3-mini (+4%) and Gemini-1.5-Pro-002 (+3%), demonstrating substantial improvements in generalization and accuracy. These results empirically validate the effectiveness and scalability of reward-driven reasoning for Text-to-SQL.

Technology Category

Application Category

📝 Abstract
Text-to-SQL is a challenging task involving multiple reasoning-intensive subtasks, including natural language understanding, database schema comprehension, and precise SQL query formulation. Existing approaches often rely on handcrafted reasoning paths with inductive biases that can limit their overall effectiveness. Motivated by the recent success of reasoning-enhanced models such as DeepSeek R1 and OpenAI o1, which effectively leverage reward-driven self-exploration to enhance reasoning capabilities and generalization, we propose a novel set of partial rewards tailored specifically for the Text-to-SQL task. Our reward set includes schema-linking, AI feedback, n-gram similarity, and syntax check, explicitly designed to address the reward sparsity issue prevalent in reinforcement learning (RL). Leveraging group relative policy optimization (GRPO), our approach explicitly encourages large language models (LLMs) to develop intrinsic reasoning skills necessary for accurate SQL query generation. With models of different sizes, we demonstrate that RL-only training with our proposed rewards consistently achieves higher accuracy and superior generalization compared to supervised fine-tuning (SFT). Remarkably, our RL-trained 14B-parameter model significantly outperforms larger proprietary models, e.g. o3-mini by 4% and Gemini-1.5-Pro-002 by 3% on the BIRD benchmark. These highlight the efficacy of our proposed RL-training framework with partial rewards for enhancing both accuracy and reasoning capabilities in Text-to-SQL tasks.
Problem

Research questions and friction points this paper is trying to address.

Enhances Text-to-SQL reasoning via tailored partial rewards
Addresses reward sparsity in reinforcement learning for SQL generation
Improves accuracy and generalization over supervised fine-tuning methods
Innovation

Methods, ideas, or system contributions that make the work stand out.

Tailored partial rewards for Text-to-SQL
Group relative policy optimization (GRPO)
Reinforcement learning enhances SQL accuracy
🔎 Similar Papers
No similar papers found.