SurgViVQA: Temporally-Grounded Video Question Answering for Surgical Scene Understanding

📅 2025-11-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing surgical video question answering (VideoQA) methods rely on static image features, neglecting temporal dynamics, and suffer from a lack of temporally annotated datasets—hindering the modeling of instrument-tissue interactions and motion events. To address this, we propose the first temporal grounding-based VideoQA framework tailored for surgical videos. We introduce REAL-Colon-VQA, the first large-scale, temporally grounded surgical VideoQA dataset featuring motion-aware annotations and diagnostic attributes, supporting open-ended, non-template questions. Our method employs a masked video-text encoder to fuse multimodal temporal features and fine-tunes a large language model for answer generation, explicitly modeling dynamic surgical interactions. Extensive evaluation demonstrates that our approach achieves +11% and +9% improvements in keyword accuracy over prior methods on REAL-Colon-VQA and EndoVis18-VQA, respectively, while exhibiting superior generalization and robustness to visual interference.

Technology Category

Application Category

📝 Abstract
Video Question Answering (VideoQA) in the surgical domain aims to enhance intraoperative understanding by enabling AI models to reason over temporally coherent events rather than isolated frames. Current approaches are limited to static image features, and available datasets often lack temporal annotations, ignoring the dynamics critical for accurate procedural interpretation. We propose SurgViVQA, a surgical VideoQA model that extends visual reasoning from static images to dynamic surgical scenes. It uses a Masked Video--Text Encoder to fuse video and question features, capturing temporal cues such as motion and tool--tissue interactions, which a fine-tuned large language model (LLM) then decodes into coherent answers. To evaluate its performance, we curated REAL-Colon-VQA, a colonoscopic video dataset that includes motion-related questions and diagnostic attributes, as well as out-of-template questions with rephrased or semantically altered formulations to assess model robustness. Experimental validation on REAL-Colon-VQA and the public EndoVis18-VQA dataset shows that SurgViVQA outperforms existing image-based VQA benchmark models, particularly in keyword accuracy, improving over PitVQA by +11% on REAL-Colon-VQA and +9% on EndoVis18-VQA. A perturbation study on the questions further confirms improved generalizability and robustness to variations in question phrasing. SurgViVQA and the REAL-Colon-VQA dataset provide a framework for temporally-aware understanding in surgical VideoQA, enabling AI models to interpret dynamic procedural contexts more effectively. Code and dataset available at https://github.com/madratak/SurgViVQA.
Problem

Research questions and friction points this paper is trying to address.

Extends visual reasoning from static images to dynamic surgical scenes
Captures temporal cues like motion and tool-tissue interactions
Improves model robustness and accuracy on surgical video question answering
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses Masked Video-Text Encoder for temporal fusion
Fine-tunes LLM to decode dynamic surgical interactions
Outperforms image-based VQA models on surgical datasets
🔎 Similar Papers
No similar papers found.
M
Mauro Orazio Drago
Dipartimento di Elettronica, Informazione e Bioingegneria (DEIB), Politecnico di Milano, Italy.
L
Luca Carlini
Dipartimento di Elettronica, Informazione e Bioingegneria (DEIB), Politecnico di Milano, Italy.
P
Pelinsu Celebi Balyemez
Dipartimento di Elettronica, Informazione e Bioingegneria (DEIB), Politecnico di Milano, Italy.
D
Dennis Pierantozzi
Dipartimento di Elettronica, Informazione e Bioingegneria (DEIB), Politecnico di Milano, Italy.
C
Chiara Lena
Dipartimento di Elettronica, Informazione e Bioingegneria (DEIB), Politecnico di Milano, Italy.
C
Cesare Hassan
IRCCS Humanitas Research Hospital, Italy.
Danail Stoyanov
Danail Stoyanov
Professor of Robot Vision, University College London
Surgical VisionSurgical AISurgical RoboticsComputer Assisted InterventionsSurgical Data Science
E
E. Momi
Dipartimento di Elettronica, Informazione e Bioingegneria (DEIB), Politecnico di Milano, Italy.
Sophia Bano
Sophia Bano
Assistant Professor in Robotics and AI, University College London
Computer VisionSurgical Data ScienceSurgical RoboticsComputer-assisted InterventionMedical Imaging
M
Mobarak I. Hoque
UCL Hawkes Institute and Department of Computer Science, University College London, UK.