๐ค AI Summary
To address the trade-off between model lightweighting and performance preservation in on-device real-time spatiotemporal action recognition, this paper proposes DVFL-Net, a lightweight video focal modulation network. Methodologically, it innovatively integrates a spatiotemporal focal modulation mechanism with a knowledge distillation framework, using Video-FocalNet Base as the teacher model and forward KL-divergence loss to guide a nano-scale student network in learning discriminative spatiotemporal features. Evaluated on UCF101, HMDB51, SSv2, and Kinetics-400, DVFL-Net achieves state-of-the-art accuracyโe.g., 94.8% on UCF101โwhile operating at ultra-low computational cost (<1.5 GFLOPs) and memory footprint. It significantly outperforms mainstream transformer-based approaches under comparable computational budgets, establishing a new Pareto-optimal balance between accuracy and efficiency for edge deployment.
๐ Abstract
The landscape of video recognition has evolved significantly, shifting from traditional Convolutional Neural Networks (CNNs) to Transformer-based architectures for improved accuracy. While 3D CNNs have been effective at capturing spatiotemporal dynamics, recent Transformer models leverage self-attention to model long-range spatial and temporal dependencies. Despite achieving state-of-the-art performance on major benchmarks, Transformers remain computationally expensive, particularly with dense video data. To address this, we propose a lightweight Video Focal Modulation Network, DVFL-Net, which distills spatiotemporal knowledge from a large pre-trained teacher into a compact nano student model, enabling efficient on-device deployment. DVFL-Net utilizes knowledge distillation and spatial-temporal feature modulation to significantly reduce computation while preserving high recognition performance. We employ forward Kullback-Leibler (KL) divergence alongside spatio-temporal focal modulation to effectively transfer both local and global context from the Video-FocalNet Base (teacher) to the proposed VFL-Net (student). We evaluate DVFL-Net on UCF50, UCF101, HMDB51, SSV2, and Kinetics-400, benchmarking it against recent state-of-the-art methods in Human Action Recognition (HAR). Additionally, we conduct a detailed ablation study analyzing the impact of forward KL divergence. The results confirm the superiority of DVFL-Net in achieving an optimal balance between performance and efficiency, demonstrating lower memory usage, reduced GFLOPs, and strong accuracy, making it a practical solution for real-time HAR applications.