MoCrop: Training Free Motion Guided Cropping for Efficient Video Action Recognition

📅 2025-09-22

📈 Citations: 0

✨ Influential: 0

career value

208K/year

🤖 AI Summary

To address the low efficiency and high computational cost of motion information utilization in video action recognition, this paper proposes MoCrop—a training-free, parameter-free motion-aware adaptive cropping method. MoCrop directly leverages motion vectors from the H.264 compressed domain and achieves clip-level spatial focusing on I-frames through motion density modeling, denoising fusion (DM), Monte Carlo sampling (MCS), and adaptive cropping (AC). It is fully compatible with mainstream backbone networks and introduces no additional training or inference parameters. On UCF101, ResNet-50 augmented with MoCrop achieves a 3.5% Top-1 accuracy gain at comparable computational cost, or a 2.4% gain with 26.5% fewer FLOPs. CoViAR with MoCrop maintains 88.5% accuracy while reducing computation by 26.7%, significantly improving the accuracy–efficiency trade-off.

Technology Category

Application Category

📝 Abstract

We introduce MoCrop, a motion-aware adaptive cropping module for efficient video action recognition in the compressed domain. MoCrop uses motion vectors that are available in H.264 video to locate motion-dense regions and produces a single clip-level crop that is applied to all I-frames at inference. The module is training free, adds no parameters, and can be plugged into diverse backbones. A lightweight pipeline that includes denoising & merge (DM), Monte Carlo sampling (MCS), and adaptive cropping (AC) via a motion-density submatrix search yields robust crops with negligible overhead. On UCF101, MoCrop improves accuracy or reduces compute. With ResNet-50, it delivers +3.5% Top-1 accuracy at equal FLOPs (attention setting), or +2.4% Top-1 accuracy with 26.5% fewer FLOPs (efficiency setting). Applied to CoViAR, it reaches 89.2% Top-1 accuracy at the original cost and 88.5% Top-1 accuracy while reducing compute from 11.6 to 8.5 GFLOPs. Consistent gains on MobileNet-V3, EfficientNet-B1, and Swin-B indicate strong generality and make MoCrop practical for real-time deployment in the compressed domain. Our code and models are available at https://github.com/microa/MoCrop.

Problem

Research questions and friction points this paper is trying to address.

Developing training-free motion-aware cropping for efficient video recognition

Utilizing H.264 motion vectors to locate motion-dense regions automatically

Reducing computational costs while maintaining or improving action recognition accuracy

Innovation

Methods, ideas, or system contributions that make the work stand out.

Motion vectors locate motion-dense regions for cropping

Training-free module with denoising, sampling, and adaptive cropping

Plug-and-play design reduces computation while maintaining accuracy

🔎 Similar Papers

Collaboratively Self-supervised Video Representation Learning for Action Recognition