Multi-entity Video Transformers for Fine-Grained Video Representation Learning

📅 2023-11-17
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
For fine-grained temporal video representation learning—e.g., frame-level action phase classification and frame retrieval—this paper proposes the Multi-Entity Video Transformer (MV-Former). Methodologically, MV-Former departs from conventional late-fusion paradigms by (1) introducing intra-frame multi-entity tokenization, representing each frame as multiple spatially grounded tokens instead of a single global vector; (2) designing a learnable spatial token pooling mechanism that efficiently aggregates multi-region features without fine-tuning the ViT backbone; and (3) integrating self-supervised ViT image features, cross-frame multi-entity attention, and a lightweight temporal encoder for robust spatiotemporal joint modeling. Evaluated on multiple fine-grained video benchmarks, MV-Former achieves state-of-the-art performance, substantially outperforming existing self-supervised approaches and even surpassing certain supervised methods. Further pretraining on Kinetics-400 yields consistent gains, demonstrating strong generalization and scalability.
📝 Abstract
The area of temporally fine-grained video representation learning aims to generate frame-by-frame representations for temporally dense tasks. In this work, we advance the state-of-the-art for this area by re-examining the design of transformer architectures for video representation learning. A salient aspect of our self-supervised method is the improved integration of spatial information in the temporal pipeline by representing multiple entities per frame. Prior works use late fusion architectures that reduce frames to a single dimensional vector before any cross-frame information is shared, while our method represents each frame as a group of entities or tokens. Our Multi-entity Video Transformer (MV-Former) architecture achieves state-of-the-art results on multiple fine-grained video benchmarks. MV-Former leverages image features from self-supervised ViTs, and employs several strategies to maximize the utility of the extracted features while also avoiding the need to fine-tune the complex ViT backbone. This includes a Learnable Spatial Token Pooling strategy, which is used to identify and extract features for multiple salient regions per frame. Our experiments show that MV-Former not only outperforms previous self-supervised methods, but also surpasses some prior works that use additional supervision or training data. When combined with additional pre-training data from Kinetics-400, MV-Former achieves a further performance boost. The code for MV-Former is available at https://github.com/facebookresearch/video_rep_learning.
Problem

Research questions and friction points this paper is trying to address.

Improving self-supervised video representation learning for fine-grained tasks
Enhancing cross-frame dynamics by processing multiple entities per frame
Achieving state-of-the-art performance on fine-grained video benchmarks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-entity tokens for cross-frame dynamics
Learnable Spatial Token Pooling strategy
State-of-the-art self-supervised video representation
🔎 Similar Papers
No similar papers found.
M
Matthew Walmer
University of Maryland, College Park
R
Rose Kanjirathinkal
Meta
K
Kai Sheng Tai
Meta
K
Keyur Muzumdar
Meta
Taipeng Tian
Taipeng Tian
Meta AI
Computer VisionAutonomous DrivingUnsupervised Learning
Abhinav Shrivastava
Abhinav Shrivastava
Associate Professor, University of Maryland, College Park
Computer VisionMachine LearningRobotics