Small LLMs for Medical NLP: a Systematic Analysis of Few-Shot, Constraint Decoding, Fine-Tuning and Continual Pre-Training in Italian

📅 2026-02-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the challenges of deploying large language models in clinical settings by systematically evaluating small-scale models—specifically Llama-3, Gemma-3, and Qwen3 variants with approximately 1 billion parameters—across 20 Italian clinical natural language processing (NLP) tasks. It provides the first comprehensive comparison of lightweight adaptation strategies, including few-shot prompting, constrained decoding, supervised fine-tuning, and continued pretraining, within the Italian healthcare NLP domain. The results demonstrate that a fine-tuned Qwen3-1.7B model outperforms the much larger Qwen3-32B by an average of 9.2 points. To support further research, the authors release the most extensive Italian medical NLP dataset to date, comprising 126 million words of emergency department text, 175 million words of pretraining corpus, and the best-performing models.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) consistently excel in diverse medical Natural Language Processing (NLP) tasks, yet their substantial computational requirements often limit deployment in real-world healthcare settings. In this work, we investigate whether "small" LLMs (around one billion parameters) can effectively perform medical tasks while maintaining competitive accuracy. We evaluate models from three major families-Llama-3, Gemma-3, and Qwen3-across 20 clinical NLP tasks among Named Entity Recognition, Relation Extraction, Case Report Form Filling, Question Answering, and Argument Mining. We systematically compare a range of adaptation strategies, both at inference time (few-shot prompting, constraint decoding) and at training time (supervised fine-tuning, continual pretraining). Fine-tuning emerges as the most effective approach, while the combination of few-shot prompting and constraint decoding offers strong lower-resource alternatives. Our results show that small LLMs can match or even surpass larger baselines, with our best configuration based on Qwen3-1.7B achieving an average score +9.2 points higher than Qwen3-32B. We release a comprehensive collection of all the publicly available Italian medical datasets for NLP tasks, together with our top-performing models. Furthermore, we release an Italian dataset of 126M words from the Emergency Department of an Italian Hospital, and 175M words from various sources that we used for continual pre-training.
Problem

Research questions and friction points this paper is trying to address.

Small LLMs
Medical NLP
Computational Efficiency
Model Performance
Healthcare Deployment
Innovation

Methods, ideas, or system contributions that make the work stand out.

small LLMs
medical NLP
constraint decoding
continual pre-training
few-shot prompting
🔎 Similar Papers
No similar papers found.