🤖 AI Summary
Existing visual language models (VLMs) lack systematic evaluation in laparoscopic surgical contexts, particularly regarding perceptual fidelity, frame-level scene understanding, and medical-knowledge-intensive reasoning.
Method: We introduce the first large-scale, surgery-specific VLM benchmark—comprising multi-source laparoscopic videos and expert annotations—and evaluate medical-specialized VLMs (e.g., LLaVA-Med, Med-PaLM-V) against general-purpose VLMs (e.g., LLaVA-1.5) across three task categories.
Contribution/Results: General-purpose VLMs achieve near-domain-parity performance on basic perception tasks but degrade substantially on medically grounded reasoning. Surprisingly, current medical-specialized VLMs underperform across all tasks, exposing critical gaps in surgical domain adaptation, robustness, and clinical reasoning capability. This work provides the first empirical evidence of general-purpose VLMs’ superiority in surgical VLM evaluation and establishes a new research direction for developing clinically reliable, surgery-aware VLMs.
📝 Abstract
While traditional computer vision models have historically struggled to generalize to endoscopic domains, the emergence of foundation models has shown promising cross-domain performance. In this work, we present the first large-scale study assessing the capabilities of Vision Language Models (VLMs) for endoscopic tasks with a specific focus on laparoscopic surgery. Using a diverse set of state-of-the-art models, multiple surgical datasets, and extensive human reference annotations, we address three key research questions: (1) Can current VLMs solve basic perception tasks on surgical images? (2) Can they handle advanced frame-based endoscopic scene understanding tasks? and (3) How do specialized medical VLMs compare to generalist models in this context? Our results reveal that VLMs can effectively perform basic surgical perception tasks, such as object counting and localization, with performance levels comparable to general domain tasks. However, their performance deteriorates significantly when the tasks require medical knowledge. Notably, we find that specialized medical VLMs currently underperform compared to generalist models across both basic and advanced surgical tasks, suggesting that they are not yet optimized for the complexity of surgical environments. These findings highlight the need for further advancements to enable VLMs to handle the unique challenges posed by surgery. Overall, our work provides important insights for the development of next-generation endoscopic AI systems and identifies key areas for improvement in medical visual language models.