Overview of the ClinicalSkillQA 2026 Shared Task on Continuous Perception and Procedural Reasoning in Clinical Skill Assessment

📅 2026-06-01

📈 Citations: 0

✨ Influential: 0

career value

180K/year

🤖 AI Summary

This work proposes a novel evaluation task designed to assess AI systems’ ability to integrate continuous visual perception, temporal structure reconstruction, and clinical workflow knowledge in the context of clinical skill assessment. Specifically, the system must reorder shuffled clinical keyframes into their correct temporal sequence and generate expert-verifiable reasoning explanations. To support this, the authors introduce a benchmark dataset comprising 200 test instances across three emergency medical procedures and employ multidimensional metrics—including task accuracy, pairwise accuracy, and BERTScore—for comprehensive evaluation. Analysis of 90 submissions from seven teams reveals that current models still face significant challenges in jointly leveraging visual evidence, temporal logic, and domain-specific knowledge. This study formalizes this reasoning task for the first time, establishing a new benchmark for multimodal understanding in clinical settings.

📝 Abstract

This paper presents an overview of the ClinicalSkillQA 2026 shared task, which was organized with the BioNLP Workshop at ACL 2026. The goal of this shared task is to evaluate continuous perception and procedural reasoning in clinical skill assessment by requiring systems to reconstruct the correct temporal order of shuffled clinical key frames and generate rationales grounded in clinical workflow knowledge. The benchmark contains 200 test-only instances sampled from clinical skill videos, covering three emergency-care procedures. Each instance is annotated with the ground-truth temporal order and an expert-verified rationale. A total of seven teams participated in the task, collectively making 90 submissions, with four teams providing system description papers. Systems are evaluated using Task Accuracy, Pairwise Accuracy, and BERTScore, which measure exact sequence reconstruction, local temporal consistency, and rationale quality, respectively. In this paper, we describe the task setup, dataset construction, and evaluation criteria. We further summarize the methodologies adopted by participating teams and present a comprehensive analysis of the submitted systems. The official results suggest that current models still struggle with continuous perception and procedural reasoning, especially when they must integrate visual evidence, temporal structure, and clinical workflow knowledge.

Problem

Research questions and friction points this paper is trying to address.

continuous perception

procedural reasoning

clinical skill assessment

temporal ordering

clinical workflow

Innovation

Methods, ideas, or system contributions that make the work stand out.

continuous perception

procedural reasoning

clinical skill assessment