🤖 AI Summary
This study addresses the critical need for localized natural language to SQL (NL2SQL) solutions in cloud-restricted biopharmaceutical manufacturing environments governed by GxP compliance. The authors present PharmaBatchDB AI, a platform that locally deploys open-source large language models—including Qwen 2.5 Coder 7B and Llama 3.1 8B—on consumer-grade hardware via Ollama, and systematically evaluates their performance on 60 domain-specific queries using a synthetically generated pharmaceutical database. This work provides the first empirical validation of local LLMs for regulatory-compliant data querying under GxP constraints, demonstrating that code-finetuned general-purpose models outperform domain-specialized alternatives. Notably, Llama 3.1 8B achieves the highest SQL compliance rate, while Qwen 2.5 Coder 7B excels in ROUGE-L scores, factual consistency, and hallucination control. These findings confirm the feasibility of localized NL2SQL systems in regulated settings, albeit with the necessity of human verification.
📝 Abstract
Biopharmaceutical manufacturing organizations operate under regulatory frameworks such as FDA guidance, EU Good Manufacturing Practice (GMP), and the EU AI Act, which can restrict the use of cloud-based artificial intelligence systems. Locally deployed large language models (LLMs) offer a privacy-preserving alternative, but their suitability for pharmaceutical manufacturing tasks remains underexplored. This study evaluates four open-source LLMs (Qwen 2.5 Coder 7B, Llama 3.1 8B, Mistral 7B, and Meditron 7B) deployed locally via Ollama for natural-language-to-SQL generation over a pharmaceutical manufacturing database.
A FastAPI-based evaluation platform, PharmaBatchDB AI, was developed using a synthetic Microsoft SQL Server database containing approximately 63,000 records across Batch, Manufacturing Execution System (MES), and Clean-In-Place (CIP) modules. Models were benchmarked on 60 domain-specific natural-language questions using metrics including SQL extraction rate, SQL compliance, factual consistency, ROUGE-L, hallucination rate, throughput, and latency.
Qwen 2.5 Coder 7B, Llama 3.1 8B, and Mistral 7B generated SQL for all evaluation tasks, while Meditron 7B failed on nearly all tasks due to context-window limitations and poor SQL generation capability. Llama 3.1 8B achieved the highest SQL compliance, whereas Qwen 2.5 Coder 7B achieved the strongest overall text similarity and factual consistency. Performance differences between the two leading models were not statistically significant.
The results show that code-tuned general-purpose LLMs outperform a domain-specific biomedical model on structured query generation for pharmaceutical manufacturing data. Although fully local, GxP-aligned NLQ systems are feasible on consumer hardware, current performance levels still require human oversight and downstream validation for regulated use.