SurgLaVi: Large-Scale Hierarchical Dataset for Surgical Vision-Language Representation Learning

📅 2025-09-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing surgical vision-language pretraining (VLP) datasets suffer from limited scale, narrow procedural coverage, coarse-grained semantics, and absence of hierarchical structure. To address these bottlenecks, we introduce SurgLaVi—the first large-scale, hierarchically structured surgical vision-language dataset, encompassing 200+ procedures and 238K video clips annotated with a three-level semantic hierarchy (phase–step–task). We design an automated pipeline for video transcription and segmentation, releasing the open-source subset SurgLaVi-β—over four times larger than the largest prior dataset. Furthermore, we propose a dual-modal filtering and context-enhanced caption generation method, and develop SurgCLIP, a CLIP-based dual-encoder contrastive learning framework. Extensive experiments demonstrate that SurgCLIP achieves significant improvements over state-of-the-art methods on surgical phase, step, action, and instrument recognition tasks, validating SurgLaVi’s effectiveness in enhancing representation learning and cross-task transferability.

Technology Category

Application Category

📝 Abstract
Vision-language pre-training (VLP) offers unique advantages for surgery by aligning language with surgical videos, enabling workflow understanding and transfer across tasks without relying on expert-labeled datasets. However, progress in surgical VLP remains constrained by the limited scale, procedural diversity, semantic quality, and hierarchical structure of existing datasets. In this work, we present SurgLaVi, the largest and most diverse surgical vision-language dataset to date, comprising nearly 240k clip-caption pairs from more than 200 procedures, and comprising hierarchical levels at phase-, step-, and task-level. At the core of SurgLaVi lies a fully automated pipeline that systematically generates fine-grained transcriptions of surgical videos and segments them into coherent procedural units. To ensure high-quality annotations, it applies dual-modality filtering to remove irrelevant and noisy samples. Within this framework, the resulting captions are enriched with contextual detail, producing annotations that are both semantically rich and easy to interpret. To ensure accessibility, we release SurgLaVi-{eta}, an open-source derivative of 113k clip-caption pairs constructed entirely from public data, which is over four times larger than existing surgical VLP datasets. To demonstrate the value of SurgLaVi datasets, we introduce SurgCLIP, a CLIP-style video-text contrastive framework with dual encoders, as a representative base model. SurgCLIP achieves consistent improvements across phase, step, action, and tool recognition, surpassing prior state-of-the-art methods, often by large margins. These results validate that large-scale, semantically rich, and hierarchically structured datasets directly translate into stronger and more generalizable representations, establishing SurgLaVi as a key resource for developing surgical foundation models.
Problem

Research questions and friction points this paper is trying to address.

Addressing limited scale and diversity in surgical vision-language datasets
Automating generation of hierarchical surgical video annotations
Enhancing surgical AI model performance through improved dataset quality
Innovation

Methods, ideas, or system contributions that make the work stand out.

Automated pipeline for surgical video transcription
Dual-modality filtering for high-quality annotations
CLIP-style contrastive framework for surgical tasks
🔎 Similar Papers
No similar papers found.