Zero-Shot Action Recognition in Surveillance Videos

📅 2024-10-28

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

To address the challenges of label scarcity, viewpoint variability, and poor video quality in surveillance footage, this paper proposes the first LVLM (Large Vision-Language Model) transfer framework for zero-shot action recognition. We innovatively adapt VideoLLaMA2 to surveillance scenarios and introduce a Self-Reflective Sampling (Self-ReS) strategy to enhance temporal token representation quality. Furthermore, we integrate cross-modal alignment with key-frame token reweighting to improve semantic grounding. Evaluated on UCF-Crime, our method achieves 44.6% zero-shot accuracy—outperforming the strongest baseline by 20%. This work is the first to empirically validate the effectiveness of LVLMs for zero-shot action recognition in real-world surveillance videos. It establishes a novel paradigm for few-shot and zero-shot video understanding, significantly advancing action comprehension under fully unsupervised conditions.

Technology Category

Application Category

📝 Abstract

The growing demand for surveillance in public spaces presents significant challenges due to the shortage of human resources. Current AI-based video surveillance systems heavily rely on core computer vision models that require extensive finetuning, which is particularly difficult in surveillance settings due to limited datasets and difficult setting (viewpoint, low quality, etc.). In this work, we propose leveraging Large Vision-Language Models (LVLMs), known for their strong zero and few-shot generalization, to tackle video understanding tasks in surveillance. Specifically, we explore VideoLLaMA2, a state-of-the-art LVLM, and an improved token-level sampling method, Self-Reflective Sampling (Self-ReS). Our experiments on the UCF-Crime dataset show that VideoLLaMA2 represents a significant leap in zero-shot performance, with 20% boost over the baseline. Self-ReS additionally increases zero-shot action recognition performance to 44.6%. These results highlight the potential of LVLMs, paired with improved sampling techniques, for advancing surveillance video analysis in diverse scenarios.

Problem

Research questions and friction points this paper is trying to address.

Addresses zero-shot action recognition in surveillance videos.

Overcomes limitations of AI systems in surveillance settings.

Enhances video understanding using Large Vision-Language Models.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages Large Vision-Language Models (LVLMs)

Introduces Self-Reflective Sampling (Self-ReS)

Achieves 44.6% zero-shot action recognition

🔎 Similar Papers

A Comprehensive Review of Few-shot Action Recognition