🤖 AI Summary
This work addresses the challenge of deploying large language models in resource-constrained clinical settings due to their high computational demands. The authors propose RadLite, a framework that leverages LoRA-based fine-tuning of compact models—Qwen2.5-3B and Qwen3-4B—combined with multi-task training and GGUF quantization to enable efficient inference across nine radiology tasks. This study demonstrates for the first time that lightly fine-tuned small models can effectively perform multi-task radiological analysis and operate entirely on CPU with modest memory requirements (1.8–2.4 GB) and throughput of 4–8 tokens per second. Compared to zero-shot baselines, RadLite achieves substantial improvements: +53% in RADS classification accuracy, +60% in natural language inference (NLI), and +89% in N staging. Furthermore, ensembling Qwen2.5 and Qwen3 yields complementary strengths and state-of-the-art performance in structured generation and information extraction tasks.
📝 Abstract
Large language models (LLMs) show promise in radiology but their deployment is limited by computational requirements that preclude use in resource-constrained clinical environments. We investigate whether small language models (SLMs) of 3-4 billion parameters can achieve strong multi-task radiology performance through LoRA fine-tuning, enabling deployment on consumer-grade CPUs. We train Qwen2.5-3B-Instruct and Qwen3-4B on 162K samples spanning 9 radiology tasks - RADS classification across 10 systems, impression generation, temporal comparison, radiology NLI, NER, abnormality detection, N/M staging, and radiology Q&A - compiled from 12 public datasets. Both models are evaluated on up to 500 held-out test samples per task with standardized metrics. Our key findings are: (1) LoRA fine-tuning dramatically improves performance over zero-shot baselines (RADS accuracy +53%, NLI +60%, N-staging +89%); (2) the two models exhibit complementary strengths - Qwen2.5 excels at structured generation tasks while Qwen3 dominates extractive tasks; (3) a task-outed oracle ensemble combining both models achieves the best performance across all tasks; (4) few-shot prompting with fine-tuned models hurts performance, demonstrating that LoRA adaptation is more effective than in-context learning for specialized domains; and (5) models can be quantized to GGUF format (~1.8-2.4GB) for CPU deployment at 4-8 tokens/second on consumer hardware. Our work demonstrates that small, efficiently fine-tuned models - which we collectively call RadLite - can serve as practical multi-task radiology AI assistants deployable entirely on consumer hardware without GPU requirements.