GNN-ViTCap: GNN-Enhanced Multiple Instance Learning with Vision Transformers for Whole Slide Image Classification and Captioning

📅 2025-07-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address challenges in whole-slide image (WSI) classification and automated pathological caption generation—including tile redundancy, loss of spatial context, and difficulty in semantic modeling—this paper proposes GNN-ViTCap, the first framework integrating Vision Transformers (ViTs) with Graph Neural Networks (GNNs). It introduces deep embedding-based dynamic clustering and scalar-point attention to select representative tiles, and jointly fine-tunes a large language model (LLM) for end-to-end multimodal classification and caption generation. By unifying structural and semantic representation learning, GNN-ViTCap achieves state-of-the-art performance: 0.934 F1-score and 0.963 AUC on BreakHis and PatchGastric for classification; and BLEU-4 = 0.811 and METEOR = 0.569 for caption quality—substantially outperforming existing methods.

Technology Category

Application Category

📝 Abstract
Microscopic assessment of histopathology images is vital for accurate cancer diagnosis and treatment. Whole Slide Image (WSI) classification and captioning have become crucial tasks in computer-aided pathology. However, microscopic WSI face challenges such as redundant patches and unknown patch positions due to subjective pathologist captures. Moreover, generating automatic pathology captions remains a significant challenge. To address these issues, we introduce a novel GNN-ViTCap framework for classification and caption generation from histopathological microscopic images. First, a visual feature extractor generates patch embeddings. Redundant patches are then removed by dynamically clustering these embeddings using deep embedded clustering and selecting representative patches via a scalar dot attention mechanism. We build a graph by connecting each node to its nearest neighbors in the similarity matrix and apply a graph neural network to capture both local and global context. The aggregated image embeddings are projected into the language model's input space through a linear layer and combined with caption tokens to fine-tune a large language model. We validate our method on the BreakHis and PatchGastric datasets. GNN-ViTCap achieves an F1 score of 0.934 and an AUC of 0.963 for classification, along with a BLEU-4 score of 0.811 and a METEOR score of 0.569 for captioning. Experimental results demonstrate that GNN-ViTCap outperforms state of the art approaches, offering a reliable and efficient solution for microscopy based patient diagnosis.
Problem

Research questions and friction points this paper is trying to address.

Addresses redundant patches in Whole Slide Image analysis
Solves unknown patch position issues in pathology images
Improves automatic caption generation for histopathology images
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic clustering removes redundant image patches
Graph neural network captures local and global context
Fine-tunes language model with aggregated image embeddings
🔎 Similar Papers
No similar papers found.
S M Taslim Uddin Raju
S M Taslim Uddin Raju
MASc in Computer Science (Specialized in AI)
Machine LearningMedical ImagingDeep LearningBiomedical Engineering.
Md. Milon Islam
Md. Milon Islam
University of Waterloo
Multimodal Machine LearningAI for HealthLarge Language Models
M
Md Rezwanul Haque
Centre for Pattern Analysis and Machine Intelligence, Department of Electrical and Computer Engineering, University of Waterloo, N2L 3G1, Ontario, Canada
Hamdi Altaheri
Hamdi Altaheri
PhD, Post Doctoral Scholar at University of Waterloo
Deep LearningFoundation ModelsSelf-Supervised Learning
F
Fakhri Karray
Machine Learning Department at Mohamed bin Zayed University of Artificial Intelligence, Abu Dhabi, United Arab Emirates and Centre for Pattern Analysis and Machine Intelligence, Department of Electrical and Computer Engineering, University of Waterloo, N2L 3G1, Ontario, Canada