🤖 AI Summary
Current surgical AI research predominantly focuses on intraoperative process understanding, yet lacks a unified capability to predict both short-term (e.g., action triplets, events) and long-term (e.g., phase transitions, remaining time) future surgical events, while relying on coarse-grained supervision and exhibiting poor generalizability. To address this, we propose a State Change Learning framework that reformulates surgical future prediction as a state transition classification problem, enabling a unified multi-task model. Our approach innovatively integrates an Action Dynamics module and Sinkhorn-Knopp–based clustering for discriminative state representation, coupled with a teacher-student architecture, video state compression, self-supervised contrastive learning, and cross-step dynamic modeling. This enables fine-grained action understanding and robust, generalizable future prediction. Extensive experiments across four benchmark datasets and three surgical procedures demonstrate significant improvements over state-of-the-art methods; cross-procedure transfer evaluations further validate the framework’s robustness and clinical applicability.
📝 Abstract
Surgical future prediction, driven by real-time AI analysis of surgical video, is critical for operating room safety and efficiency. It provides actionable insights into upcoming events, their timing, and risks-enabling better resource allocation, timely instrument readiness, and early warnings for complications (e.g., bleeding, bile duct injury). Despite this need, current surgical AI research focuses on understanding what is happening rather than predicting future events. Existing methods target specific tasks in isolation, lacking unified approaches that span both short-term (action triplets, events) and long-term horizons (remaining surgery duration, phase transitions). These methods rely on coarse-grained supervision while fine-grained surgical action triplets and steps remain underexplored. Furthermore, methods based only on future feature prediction struggle to generalize across different surgical contexts and procedures. We address these limits by reframing surgical future prediction as state-change learning. Rather than forecasting raw observations, our approach classifies state transitions between current and future timesteps. We introduce SurgFUTR, implementing this through a teacher-student architecture. Video clips are compressed into state representations via Sinkhorn-Knopp clustering; the teacher network learns from both current and future clips, while the student network predicts future states from current videos alone, guided by our Action Dynamics (ActDyn) module. We establish SFPBench with five prediction tasks spanning short-term (triplets, events) and long-term (remaining surgery duration, phase and step transitions) horizons. Experiments across four datasets and three procedures show consistent improvements. Cross-procedure transfer validates generalizability.