Reasoning-SQL: Reinforcement Learning with SQL Tailored Partial Rewards for Reasoning-Enhanced Text-to-SQL

📅 2025-03-29

📈 Citations: 0

✨ Influential: 0

career value

173K/year

🤖 AI Summary

Text-to-SQL faces inference-intensive challenges—including natural language understanding, schema awareness, and precise SQL generation—while sparse rewards in reinforcement learning (RL) severely hinder model optimization. To address this, we propose a fine-grained partial reward framework covering the entire SQL generation pipeline, introducing a novel four-dimensional reward mechanism: database schema linking, LLM self-feedback, n-gram similarity, and SQL syntax validation. Built upon the Group Relative Policy Optimization (GRPO) framework, our approach enables unsupervised internalization of structured reasoning capabilities, eliminating reliance on manually annotated reasoning traces. Evaluated on the BIRD benchmark, our RL-only trained 14B model outperforms o3-mini (+4%) and Gemini-1.5-Pro-002 (+3%), demonstrating substantial improvements in generalization and accuracy. These results empirically validate the effectiveness and scalability of reward-driven reasoning for Text-to-SQL.

Technology Category

Application Category

📝 Abstract

Text-to-SQL is a challenging task involving multiple reasoning-intensive subtasks, including natural language understanding, database schema comprehension, and precise SQL query formulation. Existing approaches often rely on handcrafted reasoning paths with inductive biases that can limit their overall effectiveness. Motivated by the recent success of reasoning-enhanced models such as DeepSeek R1 and OpenAI o1, which effectively leverage reward-driven self-exploration to enhance reasoning capabilities and generalization, we propose a novel set of partial rewards tailored specifically for the Text-to-SQL task. Our reward set includes schema-linking, AI feedback, n-gram similarity, and syntax check, explicitly designed to address the reward sparsity issue prevalent in reinforcement learning (RL). Leveraging group relative policy optimization (GRPO), our approach explicitly encourages large language models (LLMs) to develop intrinsic reasoning skills necessary for accurate SQL query generation. With models of different sizes, we demonstrate that RL-only training with our proposed rewards consistently achieves higher accuracy and superior generalization compared to supervised fine-tuning (SFT). Remarkably, our RL-trained 14B-parameter model significantly outperforms larger proprietary models, e.g. o3-mini by 4% and Gemini-1.5-Pro-002 by 3% on the BIRD benchmark. These highlight the efficacy of our proposed RL-training framework with partial rewards for enhancing both accuracy and reasoning capabilities in Text-to-SQL tasks.

Problem

Research questions and friction points this paper is trying to address.

Enhances Text-to-SQL reasoning via tailored partial rewards

Addresses reward sparsity in reinforcement learning for SQL generation

Improves accuracy and generalization over supervised fine-tuning methods

Innovation

Methods, ideas, or system contributions that make the work stand out.

Tailored partial rewards for Text-to-SQL

Group relative policy optimization (GRPO)

Reinforcement learning enhances SQL accuracy

🔎 Similar Papers

A Survey on Employing Large Language Models for Text-to-SQL Tasks