🤖 AI Summary
To address hallucination—potentially hazardous for clinical decision-making—in medical vision-language models (VLMs), this work introduces the first large-scale instruction-tuning benchmark specifically designed for hallucination mitigation: comprising over 100,000 medical images and one million instruction-response pairs. It establishes a novel dual-state data construction paradigm, uniquely annotated for clinical risk to distinguish hallucinatory from non-hallucinatory outputs, and systematically defines and quantifies medical VLM hallucination. Methodologically, it integrates clinical knowledge–enhanced data cleaning, adversarial sampling, and supervised fine-tuning (SFT) tailored to mainstream open-source medical VLMs (e.g., LLaVA-Med, PMC-VL-Chat). Experiments demonstrate substantial improvements: average zero-shot visual question answering (VQA) accuracy increases by 12.3%, clinical-risk misclassification rate decreases by 37.6%, and the approach exhibits strong generalization across disease categories and imaging modalities.
📝 Abstract
The increasing use of vision-language models (VLMs) in healthcare applications presents great challenges related to hallucinations, in which the models may generate seemingly plausible results that are in fact incorrect. Such hallucinations can jeopardize clinical decision making, potentially harming the diagnosis and treatments. In this work, we propose MedHallTune, a large-scale benchmark designed specifically to evaluate and mitigate hallucinations in medical VLMs. Comprising over 100,000 images and 1,000,000 instruction pairs, MedHallTune includes both hallucination and non-hallucination samples, each with ground-truth annotations. We conduct a comprehensive evaluation of current medical and general VLMs using MedHallTune, assessing their performance across key metrics, including clinical accuracy, relevance, detail level, and risk level. The experimental results show that fine-tuning with MedHallTune successfully improves the ability of several existing models to manage hallucinations and boost their zero-shot performance on downstream visual-question-answering (VQA) tasks, making them more reliable for practical medical applications. Our work contributes to the development of more trustworthy VLMs. Codes and dataset will be available at href{https://github.com/russellyq/MedHallTune}{MedHallTune}.