🤖 AI Summary
This study addresses the challenge of insufficient accuracy in fine-grained procedural recognition during neonatal resuscitation, which hinders adherence to clinical guidelines and quality improvement. To tackle this issue, the work proposes the first application of a localized vision-language model (VLM) combined with Low-Rank Adaptation (LoRA) fine-tuning in this domain, effectively mitigating hallucination in small-scale VLMs and enhancing activity recognition performance. The approach encompasses zero-shot VLM inference, VLM fine-tuning with a classification head, and a LoRA-enhanced variant, with TimeSformer serving as the supervised baseline. Evaluated on 13.26 hours of simulated resuscitation videos, the LoRA-finetuned VLM achieves an F1 score of 0.91, substantially outperforming TimeSformer’s 0.70, thereby demonstrating the method’s effectiveness and innovation in fine-grained understanding of clinical procedures.
📝 Abstract
Accurate documentation of newborn resuscitation is essential for quality improvement and adherence to clinical guidelines, yet remains underutilized in practice. Previous work using 3D-CNNs and Vision Transformers (ViT) has shown promising results in detecting key activities from newborn resuscitation videos, but also highlighted the challenges in recognizing such fine-grained activities. This work investigates the potential of generative AI (GenAI) methods to improve activity recognition from such videos. Specifically, we explore the use of local vision-language models (VLMs), combined with large language models (LLMs), and compare them to a supervised TimeSFormer baseline. Using a simulated dataset comprising 13.26 hours of newborn resuscitation videos, we evaluate several zero-shot VLM-based strategies and fine-tuned VLMs with classification heads, including Low-Rank Adaptation (LoRA). Our results suggest that small (local) VLMs struggle with hallucinations, but when fine-tuned with LoRA, the results reach F1 score at 0.91, surpassing the TimeSformer results of 0.70.