OR-Action: Multi-Role Video Understanding with Fine-Grained Actions

📅 2026-06-11

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the challenges of fine-grained, multi-agent action understanding in operating room videos, where occlusions, clutter, and limited viewpoints hinder effective modeling of long-range temporal structures necessary for coherent action segmentation. To this end, the authors establish the first action-centric benchmark for surgical video understanding, introducing a fine-grained multi-agent action taxonomy and leveraging scene graph state transitions to distill dense action annotations. They propose a purely visual temporal model combined with a multi-view to single-view feature alignment strategy, which significantly enhances action recognition performance in monocular settings without relying on explicit graph structures. Experimental results demonstrate that the proposed approach outperforms existing graph-based methods under both multi-view and single-view evaluation protocols.

📝 Abstract

Fine-grained understanding of operating room (OR) activity could enable workflow-aware assistance, yet remains difficult due to clutter, occlusions, and limited sensing. The prevailing approach to model this environment is scene graphs as an interpretable representation of OR interactions. Converting their frame-wise relational predictions into temporally extended, fine-grained actions however, is challenging without explicit temporal modeling. To enable a principled temporal evaluation of current OR understanding methods, we introduce the first action-centric benchmark built on a publicly available ego-exocentric OR dataset by defining a fine-grained, multi-role action taxonomy and generating dense action segments via distillation from ground-truth scene graph state changes. Experiments on this benchmark show that current scene graph prediction methods struggle to model temporal structure, even when adding explicit modeling through Graph Neural Networks. We therefore introduce a vision-only temporal model that outperforms graph-based methods significantly when using all available egocentric video as input. Building on this model we also introduce a novel multi- to single-view feature alignment strategy that improves single-view performance on multi-role action recognition, mitigating the need for extensive egocentric video capture. Benchmark and code will be released upon acceptance.

Problem

Research questions and friction points this paper is trying to address.

operating room

fine-grained action

multi-role video understanding

temporal modeling

scene graphs

Innovation

Methods, ideas, or system contributions that make the work stand out.

fine-grained action recognition

temporal modeling

multi-role video understanding