🤖 AI Summary
This work addresses the limitations of existing multi-candidate Text-to-SQL approaches—namely, redundant candidates, uniform repair strategies that ignore error-type heterogeneity, and single-perspective selection—which hinder performance on complex database schemas. To overcome these challenges, we propose a novel framework integrating diverse candidate generation, execution-feedback-driven targeted repair, and a multi-perspective hybrid selection mechanism. Specifically, we employ difficulty-smoothed reinforcement learning to generate diverse executable SQL queries, apply error-type-specific repairs guided by execution outcomes, and introduce a confidence-gated selector that jointly evaluates result consistency and structural plausibility. Combining a 32B specialized model with a general-purpose large language model in a collaborative architecture, our method achieves 75.88% execution accuracy on the BIRD development set and 91.20% on the SPIDER test set, substantially outperforming the current state-of-the-art multi-candidate system, Agentar-Scale-SQL.
📝 Abstract
Text-to-SQL on complex schemas is unreliable on a single pass, so recent systems generate multiple SQL candidates and let voting filter out errors. Yet voting alone is not enough, because the multi-candidate recipe has three coupled weaknesses: 1) sampling more from a single generator produces increasingly redundant candidates, 2) existing pipelines apply one generic correction to every non-clean execution result, while runtime errors, timeouts, and empty results each indicate a different distance from correctness, and 3) existing selectors rely on a single angle such as result-majority voting or pairwise SQL comparison, missing what other angles would have caught. We present SIRIUS-SQL, which addresses all three weaknesses. A difficulty-smoothing RL recipe trains SIRIUS-32B to generate diverse executable SQL candidates, paired with a generalist LLM that fills in gaps left by the specialist. An execution-grounded lifecycle classifies each outcome and applies targeted repair before candidates re-enter the pool. A confidence-gated hybrid selector combines execution-result agreement with pairwise SQL-form judgment, escalating only near-tied cases to a deterministic structural check. SIRIUS-SQL reaches 75.88% on BIRD dev and 91.20% on SPIDER test. Two of three generalist pairings surpass Agentar-Scale-SQL, the strongest published multi-candidate system on BIRD dev.