Parameter-Efficient VLMs for Gastrointestinal Endoscopy: Medical Image Generation and Clinical Visual Question Answering

📅 2026-05-23

📈 Citations: 0

✨ Influential: 0

career value

225K/year

🤖 AI Summary

This study addresses the challenges of limited annotated data, privacy constraints, and high computational costs associated with conventional fine-tuning that hinder the reliability and scalability of AI systems in gastrointestinal endoscopy. To overcome these issues, the authors propose the first dual-pipeline parameter-efficient fine-tuning (PEFT) framework: one pipeline leverages the Florence-2 model for clinical visual question answering (VQA), while the other employs LoRA-based fine-tuning of Stable Diffusion 2.1 to generate high-fidelity synthetic endoscopic images, thereby preserving patient privacy. Evaluated on the Kvasir-VQA dataset, the approach achieves a ROUGE-1 score of 0.92 and BLEU of 0.24; the synthetic images attain a Fréchet Bowel Distance (FBD) of 1450, reduce computational costs by nearly 90%, and demonstrate superior semantic consistency compared to baseline methods such as FLUX and MSDM, significantly enhancing model interpretability and clinical applicability.

📝 Abstract

The major limitations of gastrointestinal (GI) endoscopy AI systems arise from a shortage of annotated data, strict privacy policies, and significant bottlenecks in conventional model fine-tuning. Such limitations impede the successful application of sophisticated AI models in clinical practice, particularly affecting the reliability and scalability of diagnosis. In this paper, we present a dual-pipeline PEFT model that addresses two fundamental problems: medical Visual Question Answering (VQA) and the generation of privacy-preserving synthetic data. For clinical VQA, we adopt the Florence-2 vision-language model. Leveraging PEFT enhances model interpretability while substantially reducing the computational cost of training. Simultaneously, we employ Low-Rank Adaptation (LoRA) with Stable Diffusion 2.1 to generate high-quality GI images that enhance training databases without violating patient privacy. This research utilized the Kvasir-VQA dataset. Our Florence-2 VQA model achieved ROUGE-1 of 0.92, ROUGE-L of 0.91, and BLEU score improvements from 0.08 to 0.24. Fine-tuning on private datasets consistently showed better results than fine-tuning on public datasets. The rank-4 LoRA synthesis achieved optimal performance with a fidelity score of 0.290, an agreement score of 0.730, and a Frechet BiomedCLIP Distance (FBD) of 1450, reducing computational costs by almost 90 percent. This framework improves the clinical potential of AI in GI endoscopy. Compared to FLUX, MSDM, and Kandinsky 2.2, our model demonstrates superior FBD and strong semantic alignment. While other models lead in Fidelity or Agreement, our lower FBD indicates better image-text coherence. These results establish our approach as a robust solution for enhancing VQA and synthetic data generation in clinical AI.

Problem

Research questions and friction points this paper is trying to address.

gastrointestinal endoscopy

annotated data scarcity

privacy constraints

model fine-tuning bottleneck

clinical AI reliability

Innovation

Methods, ideas, or system contributions that make the work stand out.

Parameter-Efficient Fine-Tuning

Visual Question Answering

Low-Rank Adaptation