Visual Instruction-Finetuned Language Model for Versatile Brain MR Image Tasks

📅 2026-04-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current language models face significant challenges in multi-task brain MRI analysis, primarily due to spatial information loss during tokenization and limited availability of paired image–text data. To address these issues, this work proposes LLaBIT, the first unified framework that jointly handles four clinically relevant tasks—segmentation, image translation, report generation, and visual question answering—within a single architecture. The model mitigates spatial detail degradation by reusing feature maps from the image encoder and enhances scarce multimodal training data through an instruction-guided synthesis strategy. Evaluated across five brain MRI datasets, LLaBIT achieves state-of-the-art performance consistently and even surpasses task-specific models in segmentation and image translation, demonstrating its strong generalizability and clinical utility.

Technology Category

Application Category

📝 Abstract
LLMs have demonstrated remarkable capabilities in linguistic reasoning and are increasingly adept at vision-language tasks. The integration of image tokens into transformers has enabled direct visual input and output, advancing research from image-to-text descriptions to text-to-image generation. However, simple text-to-image generation holds limited clinical utility. In medical imaging, tasks such as image segmentation for localizing pathologies or image translation for reconstructing missing sequences have much greater clinical importance. Despite this, integrating these diverse, clinically relevant tasks within a single, versatile language model remains unexplored. Our method, LLaBIT (Large Language Model for Brain Image Translation), extends the visual reasoning of LLMs to these clinically meaningful tasks in the brain MRI domain. To mitigate the spatial information loss inherent in image tokenization, we incorporate a mechanism to reuse feature maps from the image encoder, minimizing data degradation. We also generate text data using LLMs with strict predefined instructions to augment limited image-text paired data in brain MRI. We comprehensively evaluated our method on five brain MRI datasets across four distinct tasks: report generation, visual question answering, image segmentation, and image translation. Our model not only demonstrated superior performance across all tasks but also outperformed specialized, task-specific models in direct comparisons, highlighting its efficacy and versatility
Problem

Research questions and friction points this paper is trying to address.

brain MRI
multitask learning
medical image analysis
vision-language model
clinical utility
Innovation

Methods, ideas, or system contributions that make the work stand out.

visual instruction tuning
brain MRI
feature map reuse
image-text data augmentation
multitask medical vision-language model
🔎 Similar Papers
No similar papers found.
J
Jonghun Kim
Department of Electrical and Computer Engineering, Sungkyunkwan University, Suwon, Korea
S
Sinyoung Ra
Department of Artificial Intelligence, Sungkyunkwan University, Suwon, Korea
Hyunjin Park
Hyunjin Park
Professor of Electrical-Computer Engineering and Artificial Intelligence, Sungkyunkwan University
Medical Image ComputingComputer Vision for MedicineSegmentationRegistration