Adaptation of Multi-modal Representation Models for Multi-task Surgical Computer Vision

📅 2025-07-07

📈 Citations: 0

✨ Influential: 0

career value

167K/year

🤖 AI Summary

Surgical AI confronts dual challenges of multi-task coupling and sparse annotations: conventional single-task models exhibit poor generalization, while existing multi-task approaches struggle with cross-task incomplete labeling. To address this, we propose MML-SurgAdapt—the first unified multi-task surgical understanding framework built upon vision-language models (e.g., CLIP). It introduces a novel single-positive multi-label (SPML) learning mechanism coupled with task-specific prompt engineering, enabling joint optimization of surgical phase recognition and safety-field assessment. The framework supports cross-task data fusion and robust training under noisy or partial annotations, reducing clinical annotation burden by 23%. Evaluated on Cholec80, Endoscapes2023, and CholecT50, it matches or exceeds single-task performance while demonstrating strong cross-dataset generalization. MML-SurgAdapt establishes a new paradigm for low-resource surgical AI, advancing scalable, annotation-efficient multi-task learning in minimally invasive surgery.

Technology Category

Application Category

📝 Abstract

Surgical AI often involves multiple tasks within a single procedure, like phase recognition or assessing the Critical View of Safety in laparoscopic cholecystectomy. Traditional models, built for one task at a time, lack flexibility, requiring a separate model for each. To address this, we introduce MML-SurgAdapt, a unified multi-task framework with Vision-Language Models (VLMs), specifically CLIP, to handle diverse surgical tasks through natural language supervision. A key challenge in multi-task learning is the presence of partial annotations when integrating different tasks. To overcome this, we employ Single Positive Multi-Label (SPML) learning, which traditionally reduces annotation burden by training models with only one positive label per instance. Our framework extends this approach to integrate data from multiple surgical tasks within a single procedure, enabling effective learning despite incomplete or noisy annotations. We demonstrate the effectiveness of our model on a combined dataset consisting of Cholec80, Endoscapes2023, and CholecT50, utilizing custom prompts. Extensive evaluation shows that MML-SurgAdapt performs comparably to task-specific benchmarks, with the added advantage of handling noisy annotations. It also outperforms the existing SPML frameworks for the task. By reducing the required labels by 23%, our approach proposes a more scalable and efficient labeling process, significantly easing the annotation burden on clinicians. To our knowledge, this is the first application of SPML to integrate data from multiple surgical tasks, presenting a novel and generalizable solution for multi-task learning in surgical computer vision. Implementation is available at: https://github.com/CAMMA-public/MML-SurgAdapt

Problem

Research questions and friction points this paper is trying to address.

Adapting multi-modal models for multi-task surgical vision challenges

Overcoming partial annotations in multi-task surgical learning

Reducing annotation burden in surgical AI with SPML learning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified multi-task framework using Vision-Language Models

Single Positive Multi-Label learning for partial annotations

Reduces labeling burden by 23% with noisy annotations

🔎 Similar Papers

Learning Multi-modal Representations by Watching Hundreds of Surgical Video Lectures