SlideChat: A Large Vision-Language Assistant for Whole-Slide Pathology Image Understanding

📅 2024-10-15
🏛️ arXiv.org
📈 Citations: 2
Influential: 0
📄 PDF
🤖 AI Summary
Current multimodal large models for pathology are constrained by patch-level analysis paradigms, limiting their capability to process gigapixel whole-slide images (WSIs) and lacking large-scale instruction data as well as end-to-end WSI-level vision-language reasoning. This paper introduces the first WSI-oriented visual-language dialogue assistant. We construct SlideInstruction—the first multi-scenario, whole-slide instruction dataset (4.2K WSIs, 176K VQA pairs)—and SlideBench, a comprehensive multi-task benchmark. We further propose an end-to-end WSI large model architecture integrating multi-scale encoding, hierarchical attention, and pathology knowledge injection. Evaluated on 22 tasks in SlideBench, our method achieves state-of-the-art performance on 18 tasks, attaining 81.17% accuracy on TCGA-VQA and 54.15% on BCNB-VQA. All code, data, and models are publicly released.

Technology Category

Application Category

📝 Abstract
Despite the progress made by multimodal large language models (MLLMs) in computational pathology, they remain limited by a predominant focus on patch-level analysis, missing essential contextual information at the whole-slide level. The lack of large-scale instruction datasets and the gigapixel scale of whole slide images (WSIs) pose significant developmental challenges. In this paper, we present SlideChat, the first vision-language assistant capable of understanding gigapixel whole-slide images, exhibiting excellent multimodal conversational capability and response complex instruction across diverse pathology scenarios. To support its development, we created SlideInstruction, the largest instruction-following dataset for WSIs consisting of 4.2K WSI captions and 176K VQA pairs with multiple categories. Furthermore, we propose SlideBench, a multimodal benchmark that incorporates captioning and VQA tasks to assess SlideChat's capabilities in varied clinical settings such as microscopy, diagnosis. Compared to both general and specialized MLLMs, SlideChat exhibits exceptional capabilities achieving state-of-the-art performance on 18 of 22 tasks. For example, it achieved an overall accuracy of 81.17% on SlideBench-VQA (TCGA), and 54.15% on SlideBench-VQA (BCNB). Our code, data, and model is publicly accessible at https://uni-medical.github.io/SlideChat.github.io.
Problem

Research questions and friction points this paper is trying to address.

Develops SlideChat for whole-slide pathology image understanding.
Creates SlideInstruction dataset for WSI captioning and VQA tasks.
Introduces SlideBench to evaluate multimodal pathology image analysis.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Developed SlideChat for gigapixel whole-slide image understanding.
Created SlideInstruction dataset with 4.2K captions and 176K VQA pairs.
Proposed SlideBench benchmark for multimodal pathology tasks.
🔎 Similar Papers
No similar papers found.
Y
Ying Chen
Shanghai AI Laboratory, Xiamen University
Guoan Wang
Guoan Wang
Stevens Institute of Technology
General Medical AI
Yuanfeng Ji
Yuanfeng Ji
Stanford; HKU
Computer visionMedical Image Analysis
Y
Yanjun Li
Shanghai AI Laboratory, East China Normal University
J
Jin Ye
Shanghai AI Laboratory, Monash University
T
Tian-Xin Li
Shanghai AI Laboratory
B
Bin Zhang
The First Affiliated Hospital of Jinan University
N
Nana Pei
The First Affiliated Hospital of Jinan University
Rongshan Yu
Rongshan Yu
Xiamen University
Statistical signal processingdata compressionbioinformatics
Y
Yu Qiao
Shanghai AI Laboratory
Junjun He
Junjun He
Shanghai Jiao Tong University