Joint image-instance spatial-temporal attention for few-shot action recognition

📅 2025-02-01

🏛️ Computer Vision and Image Understanding

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

In few-shot action recognition (FSAR), image-level features are vulnerable to background noise and neglect foreground action instances. To address this, we propose an image-instance dual-granularity spatiotemporal joint attention mechanism—the first to jointly model image-level and instance-level spatiotemporal dependencies within a dual-stream Transformer architecture—enabling fine-grained temporal alignment across video clips and discriminative feature focusing. Our method integrates prototype contrastive learning, dynamic frame sampling, and cross-instance relational distillation to substantially enhance few-shot generalization. Under standard few-shot protocols on UCF101 and HMDB51, our approach achieves 92.3% and 78.6% accuracy, respectively—surpassing prior state-of-the-art methods by 4.1% and 3.8%—while significantly reducing dependency on labeled samples.

Technology Category

Application Category

Problem

Research questions and friction points this paper is trying to address.

Recognize actions from limited examples in FSAR.

Address background noise in image-level features.

Integrate action-related instances with image features.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Joint Image-Instance Spatial-Temporal Attention

Action-related Instance Perception

Text-guided segmentation model

🔎 Similar Papers

No similar papers found.

Authors to Follow