Endo-CLIP: Progressive Self-Supervised Pre-training on Raw Colonoscopy Records

📅 2025-05-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Colonoscopy image analysis faces challenges including pervasive background clutter, complex medical terminology, and ambiguous multi-lesion descriptions—hindering clinical applicability of vision-language joint modeling. To address these, we propose CLEAN (Cleansing-Attunement-Unification), a three-stage progressive self-supervised pretraining framework. Stage one performs frame-level background filtering and LLM-driven clinical attribute extraction to cleanse non-informative images and texts. Stage two introduces patient-level cross-modal attention to disambiguate multiple polyps. Stage three achieves modality unification via fine-grained vision-language contrastive learning. CLEAN integrates CLIP architecture, LLM-based semantic parsing, and self-supervised learning. On zero-shot and few-shot polyp detection and classification tasks, it significantly outperforms state-of-the-art methods, enhancing diagnostic robustness and clinical relevance.

Technology Category

Application Category

📝 Abstract
Pre-training on image-text colonoscopy records offers substantial potential for improving endoscopic image analysis, but faces challenges including non-informative background images, complex medical terminology, and ambiguous multi-lesion descriptions. We introduce Endo-CLIP, a novel self-supervised framework that enhances Contrastive Language-Image Pre-training (CLIP) for this domain. Endo-CLIP's three-stage framework--cleansing, attunement, and unification--addresses these challenges by (1) removing background frames, (2) leveraging large language models to extract clinical attributes for fine-grained contrastive learning, and (3) employing patient-level cross-attention to resolve multi-polyp ambiguities. Extensive experiments demonstrate that Endo-CLIP significantly outperforms state-of-the-art pre-training methods in zero-shot and few-shot polyp detection and classification, paving the way for more accurate and clinically relevant endoscopic analysis.
Problem

Research questions and friction points this paper is trying to address.

Addresses non-informative colonoscopy background images
Handles complex medical terminology in image-text records
Resolves ambiguous multi-lesion descriptions in colonoscopy data
Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-supervised pre-training on colonoscopy records
Three-stage cleansing, attunement, unification framework
LLM-enhanced fine-grained contrastive learning
🔎 Similar Papers
No similar papers found.
Y
Yili He
Digital Medical Research Center, School of Basic Medical Sciences, Fudan University, Shanghai, China; University College London, London, UK; Shanghai Key Laboratory of MICCAI, Shanghai, China
Y
Yan Zhu
Endoscopy Center and Endoscopy Research Institute, Zhongshan Hospital, Fudan University, Shanghai, China; Shanghai Collaborative Innovation Center of Endoscopy, Shanghai, China
P
Peiyao Fu
Endoscopy Center and Endoscopy Research Institute, Zhongshan Hospital, Fudan University, Shanghai, China; Shanghai Collaborative Innovation Center of Endoscopy, Shanghai, China
R
Ruijie Yang
Shanghai Institute for Advanced Study of Zhejiang University, Shanghai, China
T
Tianyi Chen
Digital Medical Research Center, School of Basic Medical Sciences, Fudan University, Shanghai, China; Shanghai Key Laboratory of MICCAI, Shanghai, China
Zhihua Wang
Zhihua Wang
City University of Hong Kong
Computer VisionBiomedical EngineeringRobotics
Q
Quanlin Li
Endoscopy Center and Endoscopy Research Institute, Zhongshan Hospital, Fudan University, Shanghai, China; Shanghai Collaborative Innovation Center of Endoscopy, Shanghai, China
P
Pinghong Zhou
Endoscopy Center and Endoscopy Research Institute, Zhongshan Hospital, Fudan University, Shanghai, China; Shanghai Collaborative Innovation Center of Endoscopy, Shanghai, China
Xian Yang
Xian Yang
University of Manchester
Artificial IntelligenceMachine LearningHealthcare AINatural Language Processing
S
Shuo Wang
Digital Medical Research Center, School of Basic Medical Sciences, Fudan University, Shanghai, China; Shanghai Key Laboratory of MICCAI, Shanghai, China; Endoscopy Center and Endoscopy Research Institute, Zhongshan Hospital, Fudan University, Shanghai, China; Data Science Institute, Imperial College London, London, UK