🤖 AI Summary
This work addresses the tension between high semantic demands and stringent resource constraints in video anomaly detection at the edge. To this end, we propose MemoVAD, a collaborative edge–cloud framework in which a lightweight detector paired with a causal temporal encoder performs routine inference on the edge device, while a powerful cloud-based vision-language model is invoked only on-demand—specifically when the system encounters segments exhibiting high uncertainty or semantic novelty. Our approach introduces a subjective-logic-based uncertainty-aware gating mechanism and a dynamic semantic memory module that efficiently reuse cloud-derived semantic knowledge to continuously enhance the edge model. Experiments on UCF-Crime and XD-Violence datasets, along with real-world edge hardware, demonstrate that MemoVAD achieves state-of-the-art performance while substantially reducing communication overhead.
📝 Abstract
Deploying Video Anomaly Detection (VAD) in real-world surveillance faces a fundamental tension between the demand for high-level semantics to ensure effectiveness and the limited computational resources of edge devices. Vision-Language Models (VLMs) provide rich open-vocabulary semantics, but their latency and computational cost preclude on-device deployment. To address the challenge, we propose MemoVAD, an edge-cloud collaborative framework that selectively incorporates VLM semantics into streaming VAD. MemoVAD runs most inference on the edge with a lightweight detector and a causal Temporal Context Encoder (TCE) to model temporal dependencies. Specifically, we introduce an Uncertainty-Aware Gating (UAG) policy grounded in Subjective Logic to model perceived uncertainty and query the cloud-based VLM only for high-uncertainty and semantically novel clips. Besides, a Dynamic Semantic Memory (DSM) is designed to cache VLM-verified prototypes for efficient retrieval, enabling the edge model to progressively incorporate VLM-level semantics via a semantic adapter. Experiments on UCF-Crime and XD-Violence datasets via a real edge device show that MemoVAD substantially reduces communication overhead while surpassing state-of-the-art performance.