Simple Token-Efficient Vision-Language Model for Case-level Pathology Synoptic Report Generation

📅 2026-05-28

📈 Citations: 0

✨ Influential: 0

career value

164K/year

🤖 AI Summary

This work addresses the challenges of generating case-level pathological reports from whole-slide images (WSIs)—including gigapixel resolution, extremely long visual token sequences, and heterogeneity across multiple WSIs per case—by proposing a minimalist yet effective vision–language architecture. The framework comprises a frozen pathology patch encoder, a lightweight two-layer MLP aligner, and a large language model decoder, augmented with explicit WSI delimiter tokens to distinguish slides from the same case. Visual sequences are compressed using 512×512 image patches at 5× magnification, and the model is trained via a two-stage supervised strategy: WSI caption pretraining followed by case-level fine-tuning. Requiring only half the memory of an H100 GPU, the method significantly reduces computational overhead while outperforming strong baselines on ROUGE-L, METEOR, and BLEU-4 metrics, establishing the first reproducible baseline for multi-WSI pathological report generation and demonstrating robustness and practicality.

📝 Abstract

Generating clinically useful pathology reports for pathology cases from whole-slide images (WSIs) is challenging due to gigapixel resolution, long visual-token sequences, and the complexity of case-level reasoning, where a single case may contain multiple WSIs with heterogeneous tissues and ambiguous findings. We present a simple token-efficient vision--language model for case-level synoptic report generation that remains practical under constrained GPU memory. Our architecture follows a minimal three-component design: a frozen pathology patch encoder, a lightweight two-layer MLP vision-language aligner, and a large language model decoder, with an explicit WSI marker token to separate slides within a case. Training proceeds in two supervised stages: (1) aligner-only WSI captioning using heterogeneous WSI-text pairs, and (2) case-level supervised fine-tuning on case-report pairs for structured report generation. To reduce sequence length, we represent each slide using $512 \times 512$ patches at $5\times$ magnification, which reduces the average sequence length by up to $64\times$ times compared to the commonly used $20\times$ patches. Combined with efficient training techniques, we enable practical training with only half a NVIDIA H100 GPU. Across both training stages, our approach achieves high ROUGE-L/METEOR/BLEU-4 scores while being substantially more efficient in memory and runtime. In AI-based evaluations, our model is consistently preferred over strong baselines. Extensive ablations characterize performance-efficiency trade-offs and identify simple choices that improve robustness in multi-WSI settings. Overall, this work provides a strong, reproducible baseline for efficient pathology report generation, lowering the barrier to multi-WSI VLM research under limited compute.

Problem

Research questions and friction points this paper is trying to address.

vision-language model

pathology report generation

whole-slide image

case-level reasoning

token efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

token-efficient

vision-language model

whole-slide image