🤖 AI Summary
Surgical AI confronts dual challenges of multi-task coupling and sparse annotations: conventional single-task models exhibit poor generalization, while existing multi-task approaches struggle with cross-task incomplete labeling. To address this, we propose MML-SurgAdapt—the first unified multi-task surgical understanding framework built upon vision-language models (e.g., CLIP). It introduces a novel single-positive multi-label (SPML) learning mechanism coupled with task-specific prompt engineering, enabling joint optimization of surgical phase recognition and safety-field assessment. The framework supports cross-task data fusion and robust training under noisy or partial annotations, reducing clinical annotation burden by 23%. Evaluated on Cholec80, Endoscapes2023, and CholecT50, it matches or exceeds single-task performance while demonstrating strong cross-dataset generalization. MML-SurgAdapt establishes a new paradigm for low-resource surgical AI, advancing scalable, annotation-efficient multi-task learning in minimally invasive surgery.
📝 Abstract
Surgical AI often involves multiple tasks within a single procedure, like phase recognition or assessing the Critical View of Safety in laparoscopic cholecystectomy. Traditional models, built for one task at a time, lack flexibility, requiring a separate model for each. To address this, we introduce MML-SurgAdapt, a unified multi-task framework with Vision-Language Models (VLMs), specifically CLIP, to handle diverse surgical tasks through natural language supervision. A key challenge in multi-task learning is the presence of partial annotations when integrating different tasks. To overcome this, we employ Single Positive Multi-Label (SPML) learning, which traditionally reduces annotation burden by training models with only one positive label per instance. Our framework extends this approach to integrate data from multiple surgical tasks within a single procedure, enabling effective learning despite incomplete or noisy annotations. We demonstrate the effectiveness of our model on a combined dataset consisting of Cholec80, Endoscapes2023, and CholecT50, utilizing custom prompts. Extensive evaluation shows that MML-SurgAdapt performs comparably to task-specific benchmarks, with the added advantage of handling noisy annotations. It also outperforms the existing SPML frameworks for the task. By reducing the required labels by 23%, our approach proposes a more scalable and efficient labeling process, significantly easing the annotation burden on clinicians. To our knowledge, this is the first application of SPML to integrate data from multiple surgical tasks, presenting a novel and generalizable solution for multi-task learning in surgical computer vision. Implementation is available at: https://github.com/CAMMA-public/MML-SurgAdapt