π€ AI Summary
Intraoperative verbal instructions inherently suffer from semantic ambiguity, severely compromising human-robot collaboration safety. To address this, we propose a vision-language joint ambiguity resolution method: (1) constructing a surgical instrument functionality knowledge base; (2) integrating multimodal visual context from surgical videos; and (3) designing a two-layer operational reasoning framework to parse instruction semantics. Furthermore, we introduce a dual-set conformal prediction mechanism that provides statistically valid confidence estimates for robot decisions, enabling proactive identification and rejection of high-risk ambiguous instructions. Experiments on a cholecystectomy video dataset demonstrate a 60% ambiguity resolution rate, significantly improving robotic robustness in interpreting complex surgical directives and enhancing interactive safety. Our core contributions are: (i) knowledge-guided, vision-language co-reasoning for surgical instruction understanding; and (ii) a statistically guaranteed, trustworthy decision-making mechanism grounded in conformal prediction theory.
π Abstract
Effective human-robot collaboration in surgery is affected by the inherent ambiguity of verbal communication. This paper presents a framework for a robotic surgical assistant that interprets and disambiguates verbal instructions from a surgeon by grounding them in the visual context of the operating field. The system employs a two-level affordance-based reasoning process that first analyzes the surgical scene using a multimodal vision-language model and then reasons about the instruction using a knowledge base of tool capabilities. To ensure patient safety, a dual-set conformal prediction method is used to provide a statistically rigorous confidence measure for robot decisions, allowing it to identify and flag ambiguous commands. We evaluated our framework on a curated dataset of ambiguous surgical requests from cholecystectomy videos, demonstrating a general disambiguation rate of 60% and presenting a method for safer human-robot interaction in the operating room.