Challenging Vision-Language Models with Surgical Data: A New Dataset and Broad Benchmarking Study

📅 2025-06-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing visual language models (VLMs) lack systematic evaluation in laparoscopic surgical contexts, particularly regarding perceptual fidelity, frame-level scene understanding, and medical-knowledge-intensive reasoning. Method: We introduce the first large-scale, surgery-specific VLM benchmark—comprising multi-source laparoscopic videos and expert annotations—and evaluate medical-specialized VLMs (e.g., LLaVA-Med, Med-PaLM-V) against general-purpose VLMs (e.g., LLaVA-1.5) across three task categories. Contribution/Results: General-purpose VLMs achieve near-domain-parity performance on basic perception tasks but degrade substantially on medically grounded reasoning. Surprisingly, current medical-specialized VLMs underperform across all tasks, exposing critical gaps in surgical domain adaptation, robustness, and clinical reasoning capability. This work provides the first empirical evidence of general-purpose VLMs’ superiority in surgical VLM evaluation and establishes a new research direction for developing clinically reliable, surgery-aware VLMs.

Technology Category

Application Category

📝 Abstract
While traditional computer vision models have historically struggled to generalize to endoscopic domains, the emergence of foundation models has shown promising cross-domain performance. In this work, we present the first large-scale study assessing the capabilities of Vision Language Models (VLMs) for endoscopic tasks with a specific focus on laparoscopic surgery. Using a diverse set of state-of-the-art models, multiple surgical datasets, and extensive human reference annotations, we address three key research questions: (1) Can current VLMs solve basic perception tasks on surgical images? (2) Can they handle advanced frame-based endoscopic scene understanding tasks? and (3) How do specialized medical VLMs compare to generalist models in this context? Our results reveal that VLMs can effectively perform basic surgical perception tasks, such as object counting and localization, with performance levels comparable to general domain tasks. However, their performance deteriorates significantly when the tasks require medical knowledge. Notably, we find that specialized medical VLMs currently underperform compared to generalist models across both basic and advanced surgical tasks, suggesting that they are not yet optimized for the complexity of surgical environments. These findings highlight the need for further advancements to enable VLMs to handle the unique challenges posed by surgery. Overall, our work provides important insights for the development of next-generation endoscopic AI systems and identifies key areas for improvement in medical visual language models.
Problem

Research questions and friction points this paper is trying to address.

Assessing VLMs' performance on basic endoscopic perception tasks
Evaluating VLMs' capability in advanced surgical scene understanding
Comparing specialized medical VLMs vs generalist models in surgery
Innovation

Methods, ideas, or system contributions that make the work stand out.

Assessing VLMs on endoscopic surgical tasks
Comparing medical and generalist VLMs performance
Identifying gaps in surgical VLM optimization
🔎 Similar Papers
No similar papers found.
Leon Mayer
Leon Mayer
PhD Student, German Cancer Research Center (DKFZ)
T
Tim Radsch
German Cancer Research Center (DKFZ) Heidelberg, Div. Intelligent Medical Systems, Germany; National Center for Tumor Diseases (NCT), NCT Heidelberg, Germany
D
Dominik Michael
German Cancer Research Center (DKFZ) Heidelberg, Div. Intelligent Medical Systems, Germany; National Center for Tumor Diseases (NCT), NCT Heidelberg, Germany
Lucas Luttner
Lucas Luttner
PhD Student, German Cancer Research Center (DKFZ), Heidelberg University
Foundation ModelsVision-Language ModelsMedical Image ComputingSurgical Data Science
A
Amine Yamlahi
German Cancer Research Center (DKFZ) Heidelberg, Div. Intelligent Medical Systems, Germany; National Center for Tumor Diseases (NCT), NCT Heidelberg, Germany
E
Evangelia Christodoulou
German Cancer Research Center (DKFZ) Heidelberg, Div. Intelligent Medical Systems, Germany; National Center for Tumor Diseases (NCT), NCT Heidelberg, Germany
P
Patrick Godau
German Cancer Research Center (DKFZ) Heidelberg, Div. Intelligent Medical Systems, Germany; National Center for Tumor Diseases (NCT), NCT Heidelberg, Germany; HIDSS4Health - Helmholtz Information and Data Science School for Health, Germany; Faculty of Mathematics and Computer Science, Heidelberg University, Germany
M
Marcel Knopp
German Cancer Research Center (DKFZ) Heidelberg, Div. Intelligent Medical Systems, Germany; Faculty of Mathematics and Computer Science, Heidelberg University, Germany
A
Annika Reinke
German Cancer Research Center (DKFZ) Heidelberg, Div. Intelligent Medical Systems, Germany; National Center for Tumor Diseases (NCT), NCT Heidelberg, Germany
F
Fiona Kolbinger
Weldon School of Biomedical Engineering, Purdue University, West Lafayette, IN, USA; Regenstrief Center for Healthcare Engineering (RCHE), Purdue University, West Lafayette, IN, USA; Department of Biostatistics and Health Data Science, Richard M. Fairbanks School of Public Health, Indiana University School of Medicine, Indianapolis, IN, USA; Department of Surgery, Indiana University School of Medicine, Indianapolis, IN, USA; Department of Visceral, Thoracic and Vascular Surgery, University Hospital and Facu
L
Lena Maier-Hein
German Cancer Research Center (DKFZ) Heidelberg, Div. Intelligent Medical Systems, Germany; National Center for Tumor Diseases (NCT), NCT Heidelberg, Germany; DKFZ Heidelberg, Helmholtz Imaging, Germany; Medical Faculty, Heidelberg University, Germany; Faculty of Mathematics and Computer Science, Heidelberg University, Germany