π€ AI Summary
This work addresses the dual challenges of dynamic 3D scene modeling and unsupervised object tracking in self-supervised video representation learning. We propose Video-GMAE, a self-supervised framework based on a Gaussian Masked Autoencoder for videos. Its core innovation lies in representing videos as temporally evolving collections of 3D Gaussian ellipsoids and introducing Gaussian splatting trajectory modelingβa formulation that inherently couples 2D projection consistency with 3D motion priors. This design enables spontaneous emergence of object tracking capability in zero-shot settings, without any explicit trajectory or mask supervision. On Kinetics and Kubric benchmarks, Video-GMAE achieves state-of-the-art (SOTA) zero-shot tracking performance, improving by 34.6% and 13.1%, respectively, over prior self-supervised video methods.
π Abstract
We propose Video Gaussian Masked Autoencoders (Video-GMAE), a self-supervised approach for representation learning that encodes a sequence of images into a set of Gaussian splats moving over time. Representing a video as a set of Gaussians enforces a reasonable inductive bias: that 2-D videos are often consistent projections of a dynamic 3-D scene. We find that tracking emerges when pretraining a network with this architecture. Mapping the trajectory of the learnt Gaussians onto the image plane gives zero-shot tracking performance comparable to state-of-the-art. With small-scale finetuning, our models achieve 34.6% improvement on Kinetics, and 13.1% on Kubric datasets, surpassing existing self-supervised video approaches. The project page and code are publicly available at https://videogmae.org/ and https://github.com/tekotan/video-gmae.