🤖 AI Summary
Current foundation models (FMs) for surgical video analysis are hindered by the lack of large-scale, diverse, and standardized pretraining and evaluation resources. To address this, we introduce SurgBench—the first unified surgical video analysis benchmark—comprising (1) SurgBench-P, a pretraining dataset of 53 million frames spanning 22 surgical procedures, and (2) SurgBench-E, an evaluation benchmark with 72 fine-grained tasks across six analytical dimensions. Our key contribution is the first standardized framework enabling cross-procedure and cross-modal generalization assessment, integrating multi-source real-world surgical videos, granular task taxonomy, and a unified evaluation protocol. Pretraining on SurgBench-P significantly improves the performance of mainstream video FMs on surgical tasks, particularly enhancing zero-shot and few-shot transfer capabilities to unseen procedures and modalities. SurgBench thus fills a critical gap in large-scale, standardized benchmarking for surgical video understanding.
📝 Abstract
Surgical video understanding is pivotal for enabling automated intraoperative decision-making, skill assessment, and postoperative quality improvement. However, progress in developing surgical video foundation models (FMs) remains hindered by the scarcity of large-scale, diverse datasets for pretraining and systematic evaluation. In this paper, we introduce extbf{SurgBench}, a unified surgical video benchmarking framework comprising a pretraining dataset, extbf{SurgBench-P}, and an evaluation benchmark, extbf{SurgBench-E}. SurgBench offers extensive coverage of diverse surgical scenarios, with SurgBench-P encompassing 53 million frames across 22 surgical procedures and 11 specialties, and SurgBench-E providing robust evaluation across six categories (phase classification, camera motion, tool recognition, disease diagnosis, action classification, and organ detection) spanning 72 fine-grained tasks. Extensive experiments reveal that existing video FMs struggle to generalize across varied surgical video analysis tasks, whereas pretraining on SurgBench-P yields substantial performance improvements and superior cross-domain generalization to unseen procedures and modalities. Our dataset and code are available upon request.