Multi-modal Representations for Fine-grained Multi-label Critical View of Safety Recognition

📅 2025-07-07

📈 Citations: 0

✨ Influential: 0

career value

171K/year

🤖 AI Summary

To address the reliance on costly spatial annotations for critical view of safety (CVS) identification in laparoscopic cholecystectomy, this paper proposes CVS-AdaptNet—a weakly supervised, fine-grained multi-label classification method built upon the multimodal foundation model PeskaVLP. Instead of pixel- or region-level supervision, CVS-AdaptNet aligns image embeddings with clinical CVS criteria via positive and negative textual prompts, enabling text-guided cross-modal image–text alignment inference. It represents the first effective adaptation of a multimodal foundation model to surgical multi-label recognition, incorporating novel multi-label adaptation strategies and prompt engineering mechanisms. Evaluated on the Endoscapes-CVS201 dataset, CVS-AdaptNet achieves a mean average precision (mAP) of 57.6%, outperforming a vision-only ResNet50 baseline by 6.0 points. This advancement significantly improves both accuracy and deployability of intraoperative safety assessment.

Technology Category

Application Category

📝 Abstract

The Critical View of Safety (CVS) is crucial for safe laparoscopic cholecystectomy, yet assessing CVS criteria remains a complex and challenging task, even for experts. Traditional models for CVS recognition depend on vision-only models learning with costly, labor-intensive spatial annotations. This study investigates how text can be harnessed as a powerful tool for both training and inference in multi-modal surgical foundation models to automate CVS recognition. Unlike many existing multi-modal models, which are primarily adapted for multi-class classification, CVS recognition requires a multi-label framework. Zero-shot evaluation of existing multi-modal surgical models shows a significant performance gap for this task. To address this, we propose CVS-AdaptNet, a multi-label adaptation strategy that enhances fine-grained, binary classification across multiple labels by aligning image embeddings with textual descriptions of each CVS criterion using positive and negative prompts. By adapting PeskaVLP, a state-of-the-art surgical foundation model, on the Endoscapes-CVS201 dataset, CVS-AdaptNet achieves 57.6 mAP, improving over the ResNet50 image-only baseline (51.5 mAP) by 6 points. Our results show that CVS-AdaptNet's multi-label, multi-modal framework, enhanced by textual prompts, boosts CVS recognition over image-only methods. We also propose text-specific inference methods, that helps in analysing the image-text alignment. While further work is needed to match state-of-the-art spatial annotation-based methods, this approach highlights the potential of adapting generalist models to specialized surgical tasks. Code: https://github.com/CAMMA-public/CVS-AdaptNet

Problem

Research questions and friction points this paper is trying to address.

Automating Critical View of Safety recognition using multi-modal models

Enhancing multi-label classification via image-text alignment

Reducing reliance on costly spatial annotations for CVS assessment

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-modal adaptation for surgical CVS recognition

Aligns image embeddings with textual CVS descriptions

Enhances multi-label classification via positive-negative prompts

🔎 Similar Papers

Urban Safety Perception Assessments via Integrating Multimodal Large Language Models with Street View Images