RadLite: Multi-Task LoRA Fine-Tuning of Small Language Models for CPU-Deployable Radiology AI

📅 2026-05-01

📈 Citations: 0

✨ Influential: 0

career value

206K/year

🤖 AI Summary

This work addresses the challenge of deploying large language models in resource-constrained clinical settings due to their high computational demands. The authors propose RadLite, a framework that leverages LoRA-based fine-tuning of compact models—Qwen2.5-3B and Qwen3-4B—combined with multi-task training and GGUF quantization to enable efficient inference across nine radiology tasks. This study demonstrates for the first time that lightly fine-tuned small models can effectively perform multi-task radiological analysis and operate entirely on CPU with modest memory requirements (1.8–2.4 GB) and throughput of 4–8 tokens per second. Compared to zero-shot baselines, RadLite achieves substantial improvements: +53% in RADS classification accuracy, +60% in natural language inference (NLI), and +89% in N staging. Furthermore, ensembling Qwen2.5 and Qwen3 yields complementary strengths and state-of-the-art performance in structured generation and information extraction tasks.

📝 Abstract

Large language models (LLMs) show promise in radiology but their deployment is limited by computational requirements that preclude use in resource-constrained clinical environments. We investigate whether small language models (SLMs) of 3-4 billion parameters can achieve strong multi-task radiology performance through LoRA fine-tuning, enabling deployment on consumer-grade CPUs. We train Qwen2.5-3B-Instruct and Qwen3-4B on 162K samples spanning 9 radiology tasks - RADS classification across 10 systems, impression generation, temporal comparison, radiology NLI, NER, abnormality detection, N/M staging, and radiology Q&A - compiled from 12 public datasets. Both models are evaluated on up to 500 held-out test samples per task with standardized metrics. Our key findings are: (1) LoRA fine-tuning dramatically improves performance over zero-shot baselines (RADS accuracy +53%, NLI +60%, N-staging +89%); (2) the two models exhibit complementary strengths - Qwen2.5 excels at structured generation tasks while Qwen3 dominates extractive tasks; (3) a task-outed oracle ensemble combining both models achieves the best performance across all tasks; (4) few-shot prompting with fine-tuned models hurts performance, demonstrating that LoRA adaptation is more effective than in-context learning for specialized domains; and (5) models can be quantized to GGUF format (~1.8-2.4GB) for CPU deployment at 4-8 tokens/second on consumer hardware. Our work demonstrates that small, efficiently fine-tuned models - which we collectively call RadLite - can serve as practical multi-task radiology AI assistants deployable entirely on consumer hardware without GPU requirements.

Problem

Research questions and friction points this paper is trying to address.

radiology AI

small language models

CPU deployment

multi-task learning

resource-constrained environments

Innovation

Methods, ideas, or system contributions that make the work stand out.

LoRA fine-tuning

small language models

multi-task radiology AI