🤖 AI Summary
In pathology vision-language modeling, radiology-style clinical reports often contain non-morphology-based information (e.g., patient history) not inferable from H&E whole-slide images (WSIs), leading to hallucinated report generation. Method: We propose a fidelity-oriented text preprocessing strategy that retains only descriptive statements directly supported by histomorphological evidence. Built upon the BLIP-2 framework, we systematically evaluate this approach on 42,433 WSIs and 19,636 clinical reports. Contribution/Results: This is the first study to empirically characterize the trade-off between text abridgment and multimodal representation objectives: abridged reports significantly reduce hallucination rates—validated by expert pathologists—and improve report fidelity, whereas full reports better support bidirectional image–text retrieval. Our work establishes a reproducible methodological paradigm and empirical foundation for designing trustworthy textual supervision in medical multimodal modeling.
📝 Abstract
Vision-language models in pathology enable multimodal case retrieval and automated report generation. Many of the models developed so far, however, have been trained on pathology reports that include information which cannot be inferred from paired whole slide images (e.g., patient history), potentially leading to hallucinated sentences in generated reports. To this end, we investigate how the selection of information from pathology reports for vision-language modeling affects the quality of the multimodal representations and generated reports. More concretely, we compare a model trained on full reports against a model trained on preprocessed reports that only include sentences describing the cell and tissue appearances based on the H&E-stained slides. For the experiments, we built upon the BLIP-2 framework and used a cutaneous melanocytic lesion dataset of 42,433 H&E-stained whole slide images and 19,636 corresponding pathology reports. Model performance was assessed using image-to-text and text-to-image retrieval, as well as qualitative evaluation of the generated reports by an expert pathologist. Our results demonstrate that text preprocessing prevents hallucination in report generation. Despite the improvement in the quality of the generated reports, training the vision-language model on full reports showed better cross-modal retrieval performance.