On the Importance of Text Preprocessing for Multimodal Representation Learning and Pathology Report Generation

📅 2025-02-26

📈 Citations: 0

✨ Influential: 0

career value

160K/year

🤖 AI Summary

In pathology vision-language modeling, radiology-style clinical reports often contain non-morphology-based information (e.g., patient history) not inferable from H&E whole-slide images (WSIs), leading to hallucinated report generation. Method: We propose a fidelity-oriented text preprocessing strategy that retains only descriptive statements directly supported by histomorphological evidence. Built upon the BLIP-2 framework, we systematically evaluate this approach on 42,433 WSIs and 19,636 clinical reports. Contribution/Results: This is the first study to empirically characterize the trade-off between text abridgment and multimodal representation objectives: abridged reports significantly reduce hallucination rates—validated by expert pathologists—and improve report fidelity, whereas full reports better support bidirectional image–text retrieval. Our work establishes a reproducible methodological paradigm and empirical foundation for designing trustworthy textual supervision in medical multimodal modeling.

Technology Category

Application Category

📝 Abstract

Vision-language models in pathology enable multimodal case retrieval and automated report generation. Many of the models developed so far, however, have been trained on pathology reports that include information which cannot be inferred from paired whole slide images (e.g., patient history), potentially leading to hallucinated sentences in generated reports. To this end, we investigate how the selection of information from pathology reports for vision-language modeling affects the quality of the multimodal representations and generated reports. More concretely, we compare a model trained on full reports against a model trained on preprocessed reports that only include sentences describing the cell and tissue appearances based on the H&E-stained slides. For the experiments, we built upon the BLIP-2 framework and used a cutaneous melanocytic lesion dataset of 42,433 H&E-stained whole slide images and 19,636 corresponding pathology reports. Model performance was assessed using image-to-text and text-to-image retrieval, as well as qualitative evaluation of the generated reports by an expert pathologist. Our results demonstrate that text preprocessing prevents hallucination in report generation. Despite the improvement in the quality of the generated reports, training the vision-language model on full reports showed better cross-modal retrieval performance.

Problem

Research questions and friction points this paper is trying to address.

Text preprocessing impact on multimodal representations

Preventing hallucination in pathology report generation

Comparing full vs. preprocessed report training effects

Innovation

Methods, ideas, or system contributions that make the work stand out.

Text preprocessing prevents hallucination

BLIP-2 framework enhances multimodal learning

Selective report data improves model accuracy

🔎 Similar Papers

Alifuse: Aligning and Fusing Multimodal Medical Data for Computer-Aided Diagnosis