🤖 AI Summary
In text-to-SQL, self-play fine-tuning methods (e.g., SPIN) suffer from limited improvement of the main model due to insufficient information gain and excessive generation of correct SQL by the opponent model. To address this, we propose SPFT-SQL, a novel self-play framework featuring: (1) verification-driven iterative data construction—leveraging SQL execution feedback to select high-quality synthetic examples; and (2) an error-directed loss mechanism—explicitly encouraging the opponent model to generate discriminative erroneous SQL, thereby enhancing the main model’s ability to detect and correct semantic-structural mismatches. Extensive experiments across six open-source large language models and five mainstream benchmarks demonstrate that SPFT-SQL consistently outperforms existing state-of-the-art methods, achieving average accuracy gains of 3.2–7.8 percentage points.
📝 Abstract
Despite the significant advancements of self-play fine-tuning (SPIN), which can transform a weak large language model (LLM) into a strong one through competitive interactions between models of varying capabilities, it still faces challenges in the Text-to-SQL task. SPIN does not generate new information, and the large number of correct SQL queries produced by the opponent model during self-play reduces the main model's ability to generate accurate SQL queries. To address this challenge, we propose a new self-play fine-tuning method tailored for the Text-to-SQL task, called SPFT-SQL. Prior to self-play, we introduce a verification-based iterative fine-tuning approach, which synthesizes high-quality fine-tuning data iteratively based on the database schema and validation feedback to enhance model performance, while building a model base with varying capabilities. During the self-play fine-tuning phase, we propose an error-driven loss method that incentivizes incorrect outputs from the opponent model, enabling the main model to distinguish between correct SQL and erroneous SQL generated by the opponent model, thereby improving its ability to generate correct SQL. Extensive experiments and in-depth analyses on six open-source LLMs and five widely used benchmarks demonstrate that our approach outperforms existing state-of-the-art (SOTA) methods.