On the Importance of Text Preprocessing for Multimodal Representation Learning and Pathology Report Generation

📅 2025-02-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In pathology vision-language modeling, radiology-style clinical reports often contain non-morphology-based information (e.g., patient history) not inferable from H&E whole-slide images (WSIs), leading to hallucinated report generation. Method: We propose a fidelity-oriented text preprocessing strategy that retains only descriptive statements directly supported by histomorphological evidence. Built upon the BLIP-2 framework, we systematically evaluate this approach on 42,433 WSIs and 19,636 clinical reports. Contribution/Results: This is the first study to empirically characterize the trade-off between text abridgment and multimodal representation objectives: abridged reports significantly reduce hallucination rates—validated by expert pathologists—and improve report fidelity, whereas full reports better support bidirectional image–text retrieval. Our work establishes a reproducible methodological paradigm and empirical foundation for designing trustworthy textual supervision in medical multimodal modeling.

Technology Category

Application Category

📝 Abstract
Vision-language models in pathology enable multimodal case retrieval and automated report generation. Many of the models developed so far, however, have been trained on pathology reports that include information which cannot be inferred from paired whole slide images (e.g., patient history), potentially leading to hallucinated sentences in generated reports. To this end, we investigate how the selection of information from pathology reports for vision-language modeling affects the quality of the multimodal representations and generated reports. More concretely, we compare a model trained on full reports against a model trained on preprocessed reports that only include sentences describing the cell and tissue appearances based on the H&E-stained slides. For the experiments, we built upon the BLIP-2 framework and used a cutaneous melanocytic lesion dataset of 42,433 H&E-stained whole slide images and 19,636 corresponding pathology reports. Model performance was assessed using image-to-text and text-to-image retrieval, as well as qualitative evaluation of the generated reports by an expert pathologist. Our results demonstrate that text preprocessing prevents hallucination in report generation. Despite the improvement in the quality of the generated reports, training the vision-language model on full reports showed better cross-modal retrieval performance.
Problem

Research questions and friction points this paper is trying to address.

Text preprocessing impact on multimodal representations
Preventing hallucination in pathology report generation
Comparing full vs. preprocessed report training effects
Innovation

Methods, ideas, or system contributions that make the work stand out.

Text preprocessing prevents hallucination
BLIP-2 framework enhances multimodal learning
Selective report data improves model accuracy
🔎 Similar Papers
No similar papers found.
R
Ruben T. Lucassen
Dept. of Pathology, University Medical Center Utrecht, the Netherlands; Dept. of Biomedical Engineering, Eindhoven University of Technology, the Netherlands
T
Tijn van de Luijtgaarden
Dept. of Mathematics and Computer Science, Eindhoven University of Technology, the Netherlands
S
Sander P.J. Moonemans
Dept. of Mathematics and Computer Science, Eindhoven University of Technology, the Netherlands
G
Gerben E. Breimer
Dept. of Pathology, University Medical Center Utrecht, the Netherlands
W
Willeke A.M. Blokx
Dept. of Pathology, University Medical Center Utrecht, the Netherlands
Mitko Veta
Mitko Veta
Associate Professor, Eindhoven University of Technology
Medical Image AnalysisDigital PathologyMachine Learning