DVFL-Net: A Lightweight Distilled Video Focal Modulation Network for Spatio-Temporal Action Recognition

📅 2025-07-16

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

To address the trade-off between model lightweighting and performance preservation in on-device real-time spatiotemporal action recognition, this paper proposes DVFL-Net, a lightweight video focal modulation network. Methodologically, it innovatively integrates a spatiotemporal focal modulation mechanism with a knowledge distillation framework, using Video-FocalNet Base as the teacher model and forward KL-divergence loss to guide a nano-scale student network in learning discriminative spatiotemporal features. Evaluated on UCF101, HMDB51, SSv2, and Kinetics-400, DVFL-Net achieves state-of-the-art accuracy—e.g., 94.8% on UCF101—while operating at ultra-low computational cost (<1.5 GFLOPs) and memory footprint. It significantly outperforms mainstream transformer-based approaches under comparable computational budgets, establishing a new Pareto-optimal balance between accuracy and efficiency for edge deployment.

Technology Category

Application Category

📝 Abstract

The landscape of video recognition has evolved significantly, shifting from traditional Convolutional Neural Networks (CNNs) to Transformer-based architectures for improved accuracy. While 3D CNNs have been effective at capturing spatiotemporal dynamics, recent Transformer models leverage self-attention to model long-range spatial and temporal dependencies. Despite achieving state-of-the-art performance on major benchmarks, Transformers remain computationally expensive, particularly with dense video data. To address this, we propose a lightweight Video Focal Modulation Network, DVFL-Net, which distills spatiotemporal knowledge from a large pre-trained teacher into a compact nano student model, enabling efficient on-device deployment. DVFL-Net utilizes knowledge distillation and spatial-temporal feature modulation to significantly reduce computation while preserving high recognition performance. We employ forward Kullback-Leibler (KL) divergence alongside spatio-temporal focal modulation to effectively transfer both local and global context from the Video-FocalNet Base (teacher) to the proposed VFL-Net (student). We evaluate DVFL-Net on UCF50, UCF101, HMDB51, SSV2, and Kinetics-400, benchmarking it against recent state-of-the-art methods in Human Action Recognition (HAR). Additionally, we conduct a detailed ablation study analyzing the impact of forward KL divergence. The results confirm the superiority of DVFL-Net in achieving an optimal balance between performance and efficiency, demonstrating lower memory usage, reduced GFLOPs, and strong accuracy, making it a practical solution for real-time HAR applications.

Problem

Research questions and friction points this paper is trying to address.

Reducing computation cost in video action recognition

Balancing performance and efficiency in HAR

Enabling on-device deployment for real-time HAR

Innovation

Methods, ideas, or system contributions that make the work stand out.

Lightweight distilled video focal modulation network

Knowledge distillation for efficient deployment

Spatio-temporal focal modulation reduces computation

🔎 Similar Papers

No similar papers found.