MedHallTune: An Instruction-Tuning Benchmark for Mitigating Medical Hallucination in Vision-Language Models

📅 2025-02-28

📈 Citations: 0

✨ Influential: 0

career value

210K/year

🤖 AI Summary

To address hallucination—potentially hazardous for clinical decision-making—in medical vision-language models (VLMs), this work introduces the first large-scale instruction-tuning benchmark specifically designed for hallucination mitigation: comprising over 100,000 medical images and one million instruction-response pairs. It establishes a novel dual-state data construction paradigm, uniquely annotated for clinical risk to distinguish hallucinatory from non-hallucinatory outputs, and systematically defines and quantifies medical VLM hallucination. Methodologically, it integrates clinical knowledge–enhanced data cleaning, adversarial sampling, and supervised fine-tuning (SFT) tailored to mainstream open-source medical VLMs (e.g., LLaVA-Med, PMC-VL-Chat). Experiments demonstrate substantial improvements: average zero-shot visual question answering (VQA) accuracy increases by 12.3%, clinical-risk misclassification rate decreases by 37.6%, and the approach exhibits strong generalization across disease categories and imaging modalities.

Technology Category

Application Category

📝 Abstract

The increasing use of vision-language models (VLMs) in healthcare applications presents great challenges related to hallucinations, in which the models may generate seemingly plausible results that are in fact incorrect. Such hallucinations can jeopardize clinical decision making, potentially harming the diagnosis and treatments. In this work, we propose MedHallTune, a large-scale benchmark designed specifically to evaluate and mitigate hallucinations in medical VLMs. Comprising over 100,000 images and 1,000,000 instruction pairs, MedHallTune includes both hallucination and non-hallucination samples, each with ground-truth annotations. We conduct a comprehensive evaluation of current medical and general VLMs using MedHallTune, assessing their performance across key metrics, including clinical accuracy, relevance, detail level, and risk level. The experimental results show that fine-tuning with MedHallTune successfully improves the ability of several existing models to manage hallucinations and boost their zero-shot performance on downstream visual-question-answering (VQA) tasks, making them more reliable for practical medical applications. Our work contributes to the development of more trustworthy VLMs. Codes and dataset will be available at href{https://github.com/russellyq/MedHallTune}{MedHallTune}.

Problem

Research questions and friction points this paper is trying to address.

Mitigate hallucinations in medical vision-language models.

Evaluate VLMs using a large-scale benchmark with annotations.

Improve model reliability for clinical decision-making applications.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Large-scale benchmark for medical VLM evaluation

Includes hallucination and non-hallucination annotated samples

Improves model reliability via fine-tuning on MedHallTune

🔎 Similar Papers

No similar papers found.