Scaling Arabic Medical Chatbots Using Synthetic Data: Enhancing Generative AI with Synthetic Patient Records

📅 2025-09-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Arabic medical dialogue systems suffer from a severe scarcity of high-quality annotated data. To address this, we propose a generative-AI–driven synthetic data augmentation framework: leveraging ChatGPT-4o and Gemini 2.5 Pro to generate medical question-answer pairs, followed by semantic filtering and rigorous human validation, yielding a high-fidelity synthetic dataset of 100,000 patient records. We fine-tune Arabic-language LMs—including Mistral-7B and AraGPT2—on this corpus. Experiments demonstrate that ChatGPT-4o–generated data significantly outperforms Gemini 2.5 Pro–generated data, yielding substantial F1-score improvements and markedly reducing hallucination rates. Effectiveness is corroborated jointly by BERTScore metrics and domain-expert evaluation. This work constitutes the first systematic empirical validation of synthetic data’s feasibility and superiority for low-resource Arabic medical NLP, establishing a scalable, high-quality paradigm for developing robust Arabic medical dialogue systems.

Technology Category

Application Category

📝 Abstract
The development of medical chatbots in Arabic is significantly constrained by the scarcity of large-scale, high-quality annotated datasets. While prior efforts compiled a dataset of 20,000 Arabic patient-doctor interactions from social media to fine-tune large language models (LLMs), model scalability and generalization remained limited. In this study, we propose a scalable synthetic data augmentation strategy to expand the training corpus to 100,000 records. Using advanced generative AI systems ChatGPT-4o and Gemini 2.5 Pro we generated 80,000 contextually relevant and medically coherent synthetic question-answer pairs grounded in the structure of the original dataset. These synthetic samples were semantically filtered, manually validated, and integrated into the training pipeline. We fine-tuned five LLMs, including Mistral-7B and AraGPT2, and evaluated their performance using BERTScore metrics and expert-driven qualitative assessments. To further analyze the effectiveness of synthetic sources, we conducted an ablation study comparing ChatGPT-4o and Gemini-generated data independently. The results showed that ChatGPT-4o data consistently led to higher F1-scores and fewer hallucinations across all models. Overall, our findings demonstrate the viability of synthetic augmentation as a practical solution for enhancing domain-specific language models in-low resource medical NLP, paving the way for more inclusive, scalable, and accurate Arabic healthcare chatbot systems.
Problem

Research questions and friction points this paper is trying to address.

Addressing scarcity of annotated Arabic medical datasets
Enhancing scalability and generalization of medical chatbots
Evaluating synthetic data augmentation for low-resource NLP
Innovation

Methods, ideas, or system contributions that make the work stand out.

Synthetic data augmentation for training expansion
Using ChatGPT-4o and Gemini for synthetic generation
Fine-tuning LLMs with filtered synthetic medical pairs
🔎 Similar Papers
No similar papers found.