TrAction: Action Recognition with Sparse Trajectories

📅 2026-06-02

📈 Citations: 0

✨ Influential: 0

career value

171K/year

🤖 AI Summary

This work addresses the susceptibility of existing action recognition models to appearance- and background-based shortcuts by advocating sparse point trajectories as an unbiased input modality. The study systematically demonstrates, for the first time, the complementarity between such trajectory representations and state-of-the-art appearance features. To effectively leverage this modality, the authors introduce a 2.5D trajectory Transformer architecture together with a masked trajectory pretraining strategy. The proposed method achieves top-1 accuracies of 45% on Something-Something V2 and 54% on EPIC-Kitchens-100; when fused with DINOv2 or V-JEPA features, performance improves by 8.7 percentage points. Furthermore, the approach exhibits superior sensitivity to temporal reversal compared to V-JEPA, significantly enhancing both robustness and accuracy in action recognition.

📝 Abstract

Modern action recognition models operate on memory- and compute-intensive dense RGB video volumes and frequently exploit appearance and background shortcuts, for example, predicting actions from objects or scenes instead of characteristic motion. We investigate an efficient alternative input modality that is largely free of such biases by construction: sparse point trajectories. To this end, we develop a simple transformer architecture for 2.5D trajectory-based recognition together with a masked-trajectory pretraining, which we show to substantially improve downstream action recognition accuracy. Despite using only a fraction of the dense RGB input, our method reaches 45% top-1 on Something-Something V2 and 54% on EPIC-Kitchens-100, and surpasses V-JEPA on time-reversal sensitivity. More importantly, we find trajectory features to be complementary to state-of-the-art appearance-based features. Fusing our pretrained model with DINOv2 and V-JEPA 2 improves top-1 accuracy on Something-Something V2 by 8.7 and 1.6 points, respectively. Code: https://github.com/ecker-lab/TrAction

Problem

Research questions and friction points this paper is trying to address.

action recognition

sparse trajectories

appearance bias

motion modeling

video understanding

Innovation

Methods, ideas, or system contributions that make the work stand out.

sparse trajectories

action recognition

masked pretraining