🤖 AI Summary
In few-shot action recognition (FSAR), image-level features are vulnerable to background noise and neglect foreground action instances. To address this, we propose an image-instance dual-granularity spatiotemporal joint attention mechanism—the first to jointly model image-level and instance-level spatiotemporal dependencies within a dual-stream Transformer architecture—enabling fine-grained temporal alignment across video clips and discriminative feature focusing. Our method integrates prototype contrastive learning, dynamic frame sampling, and cross-instance relational distillation to substantially enhance few-shot generalization. Under standard few-shot protocols on UCF101 and HMDB51, our approach achieves 92.3% and 78.6% accuracy, respectively—surpassing prior state-of-the-art methods by 4.1% and 3.8%—while significantly reducing dependency on labeled samples.