LLM-Driven Data Generation and a Novel Soft Metric for Evaluating Text-to-SQL in Aviation MRO

📅 2025-06-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In aviation MRO text-to-SQL, coarse-grained evaluation metrics (e.g., binary execution accuracy) and scarcity of high-quality annotated data jointly hinder progress. To address this, we propose: (1) an F1-based soft evaluation metric quantifying SQL semantic correctness via information overlap—enabling fine-grained, attributable assessment; and (2) a schema-driven LLM synthetic framework that leverages database structure-aware prompting and execution-result semantic alignment to generate high-fidelity question-SQL pairs. Evaluated on a real-world aviation MRO database, our soft metric significantly improves error localization capability. The synthesized data constitutes the first domain-specific text-to-SQL benchmark for aviation MRO, demonstrating superior reliability and validity over conventional evaluation paradigms.

Technology Category

Application Category

📝 Abstract
The application of Large Language Models (LLMs) to text-to-SQL tasks promises to democratize data access, particularly in critical industries like aviation Maintenance, Repair, and Operation (MRO). However, progress is hindered by two key challenges: the rigidity of conventional evaluation metrics such as execution accuracy, which offer coarse, binary feedback, and the scarcity of domain-specific evaluation datasets. This paper addresses these gaps. To enable more nuanced assessment, we introduce a novel F1-score-based 'soft' metric that quantifies the informational overlap between generated and ground-truth SQL results. To address data scarcity, we propose an LLM-driven pipeline that synthesizes realistic question-SQL pairs from database schemas. We demonstrate our contributions through an empirical evaluation on an authentic MRO database. Our experiments show that the proposed soft metric provides more insightful performance analysis than strict accuracy, and our data generation technique is effective in creating a domain-specific benchmark. Together, these contributions offer a robust framework for evaluating and advancing text-to-SQL systems in specialized environments.
Problem

Research questions and friction points this paper is trying to address.

Rigid evaluation metrics limit text-to-SQL performance analysis
Lack of domain-specific datasets hinders aviation MRO applications
Need for nuanced SQL comparison beyond binary accuracy metrics
Innovation

Methods, ideas, or system contributions that make the work stand out.

Novel F1-score-based soft SQL evaluation metric
LLM-driven pipeline for question-SQL pair synthesis
Domain-specific benchmark creation for aviation MRO
🔎 Similar Papers
No similar papers found.
P
Patrick Sutanto
Institut Sains dan Teknologi Terpadu Surabaya (ISTTS), Soji.ai
J
Jonathan Kenrick
Institut Sains dan Teknologi Terpadu Surabaya (ISTTS), Soji.ai
M
Max Lorenz
Soji.ai
Joan Santoso
Joan Santoso
Institut Sains dan Teknologi Terpadu Surabaya (Sekolah Tinggi Teknik Surabaya)
Web MiningNatural Language ProcessingData MiningArtificial IntelligenceBig Data Analytics