Trokens: Semantic-Aware Relational Trajectory Tokens for Few-Shot Action Recognition

📅 2025-08-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address insufficient trajectory point selection and motion modeling in few-shot action recognition, this paper proposes a semantic-aware trajectory modeling framework. Methodologically, it integrates point tracking, saliency estimation, Histogram of Directions (HoD) encoding, and a relational token network. Its key contributions are: (1) a semantic-saliency-guided point sampling strategy that prioritizes informative and discriminative tracked points; and (2) joint modeling of intra-trajectory motion features (via HoD) and inter-trajectory semantic relationships using learned relational tokens to achieve deep motion-appearance fusion. Evaluated on six mainstream few-shot action recognition benchmarks—including Something-Something-V2, Kinetics, and UCF101—the method achieves state-of-the-art performance, demonstrating significant improvements in few-shot generalization capability and fine-grained action discrimination.

Technology Category

Application Category

📝 Abstract
Video understanding requires effective modeling of both motion and appearance information, particularly for few-shot action recognition. While recent advances in point tracking have been shown to improve few-shot action recognition, two fundamental challenges persist: selecting informative points to track and effectively modeling their motion patterns. We present Trokens, a novel approach that transforms trajectory points into semantic-aware relational tokens for action recognition. First, we introduce a semantic-aware sampling strategy to adaptively distribute tracking points based on object scale and semantic relevance. Second, we develop a motion modeling framework that captures both intra-trajectory dynamics through the Histogram of Oriented Displacements (HoD) and inter-trajectory relationships to model complex action patterns. Our approach effectively combines these trajectory tokens with semantic features to enhance appearance features with motion information, achieving state-of-the-art performance across six diverse few-shot action recognition benchmarks: Something-Something-V2 (both full and small splits), Kinetics, UCF101, HMDB51, and FineGym. For project page see https://trokens-iccv25.github.io
Problem

Research questions and friction points this paper is trying to address.

Selecting informative points for tracking in videos
Modeling motion patterns of tracked points effectively
Enhancing appearance features with motion for action recognition
Innovation

Methods, ideas, or system contributions that make the work stand out.

Semantic-aware sampling for adaptive point tracking
HoD and inter-trajectory motion modeling framework
Combining trajectory tokens with semantic features
🔎 Similar Papers
No similar papers found.