🤖 AI Summary
In transnasal transsphenoidal pituitary surgery, limited field-of-view and highly dynamic procedural workflows hinder existing vision-language models—particularly visual question answering (VQA) systems—from reliably anticipating future surgical events.
Method: This paper introduces a novel paradigm for prospective surgical reasoning. We first construct PitVQA-Anticipation, the first surgical VQA dataset explicitly designed for future-event anticipation. Second, we propose SurgAnt-ViVQA, a model featuring a GRU-driven temporal cross-attention mechanism to enable fine-grained vision–language alignment and inter-frame temporal modeling; it further incorporates gated visual context injection and parameter-efficient fine-tuning to adapt large language models.
Results: Experiments demonstrate significant improvements over image- and video-based baselines on both PitVQA-Anticipation and EndoVis. Ablation studies confirm the critical roles of temporal modeling and gated fusion: 8-frame sequences ensure output fluency, while 32-frame sequences notably enhance temporal estimation accuracy.
📝 Abstract
Anticipating forthcoming surgical events is vital for real-time assistance in endonasal transsphenoidal pituitary surgery, where visibility is limited and workflow changes rapidly. Most visual question answering (VQA) systems reason on isolated frames with static vision language alignment, providing little support for forecasting next steps or instrument needs. Existing surgical VQA datasets likewise center on the current scene rather than the near future. We introduce PitVQA-Anticipation, the first VQA dataset designed for forward looking surgical reasoning. It comprises 33.5 hours of operative video and 734,769 question answer pairs built from temporally grouped clips and expert annotations across four tasks: predicting the future phase, next step, upcoming instrument, and remaining duration. We further propose SurgAnt-ViVQA, a video language model that adapts a large language model using a GRU Gated Temporal Cross-Attention module. A bidirectional GRU encodes frame to frame dynamics, while an adaptive gate injects visual context into the language stream at the token level. Parameter efficient fine tuning customizes the language backbone to the surgical domain. SurgAnt-ViVQA tested upon on PitVQA-Anticipation and EndoVis datasets, surpassing strong image and video based baselines. Ablations show that temporal recurrence and gated fusion drive most of the gains. A frame budget study indicates a trade-off: 8 frames maximize fluency, whereas 32 frames slightly reduce BLEU but improve numeric time estimation. By pairing a temporally aware encoder with fine grained gated cross-attention, SurgAnt-ViVQA advances surgical VQA from retrospective description to proactive anticipation. PitVQA-Anticipation offers a comprehensive benchmark for this setting and highlights the importance of targeted temporal modeling for reliable, future aware surgical assistance.