🤖 AI Summary
This study addresses a critical limitation in existing Arabic medical text generation approaches, which neglect variations in clinical severity during fine-tuning—a shortcoming that may lead to severe consequences for high-risk patients. To mitigate this issue, the authors propose a severity-aware weighted loss function that dynamically adjusts token-level loss weights based on soft severity probabilities, prioritizing the accurate generation of clinically critical content without modifying the underlying model architecture. This work represents the first effort to incorporate clinical severity information into Arabic medical text generation. Evaluated on the MAQA dataset using an AraBERT-based severity classifier and applied across multiple large language models—including AraGPT2 and Qwen2.5—the proposed method consistently outperforms standard cross-entropy fine-tuning, achieving a peak performance of 67.18% and yielding up to a 12.10% absolute improvement over baseline models.
📝 Abstract
Large language models have shown strong potential for Arabic medical text generation; however, traditional fine-tuning objectives treat all medical cases uniformly, ignoring differences in clinical severity. This limitation is particularly critical in healthcare settings, where errors in severe cases contain higher clinical risk. In this work, we propose a severity-aware weighted loss for fine-tuning Arabic language models on medical complaint-response data. The method depends on soft severity probabilities to dynamically scale token-level loss contributions during optimization, thereby prioritizing clinically critical interactions without modifying model architectures. Experiments are conducted using the MAQA dataset, which provides Arabic medical complaints and trusted human responses. Severity labels and probabilistic scores are automatically derived using a fine-tuned AraBERT-based classifier and incorporated exclusively at the loss level. The proposed approach is evaluated across ten Arabic large language models of varying architectures and parameter scales. While standard cross-entropy fine-tuning yields only modest improvements, severity-aware optimization consistently achieves larger gains. Using a balanced weighting configuration, performance improves from 54.04% to 66.14% for AraGPT2-Base, from 59.16% to 67.18% for AraGPT2-Medium, and from 57.83% to 66.86% for Qwen2.5-0.5B, with peak performance reaching 67.18%. Overall, severity-aware fine-tuning delivers improvements of up to 12.10% over non-fine-tuned baselines, demonstrating robust and architecture-consistent gains.