Tracking by Predicting 3-D Gaussians Over Time

📅 2025-12-27

📈 Citations: 0

✨ Influential: 0

career value

190K/year

🤖 AI Summary

This work addresses the dual challenges of dynamic 3D scene modeling and unsupervised object tracking in self-supervised video representation learning. We propose Video-GMAE, a self-supervised framework based on a Gaussian Masked Autoencoder for videos. Its core innovation lies in representing videos as temporally evolving collections of 3D Gaussian ellipsoids and introducing Gaussian splatting trajectory modeling—a formulation that inherently couples 2D projection consistency with 3D motion priors. This design enables spontaneous emergence of object tracking capability in zero-shot settings, without any explicit trajectory or mask supervision. On Kinetics and Kubric benchmarks, Video-GMAE achieves state-of-the-art (SOTA) zero-shot tracking performance, improving by 34.6% and 13.1%, respectively, over prior self-supervised video methods.

Technology Category

Application Category

📝 Abstract

We propose Video Gaussian Masked Autoencoders (Video-GMAE), a self-supervised approach for representation learning that encodes a sequence of images into a set of Gaussian splats moving over time. Representing a video as a set of Gaussians enforces a reasonable inductive bias: that 2-D videos are often consistent projections of a dynamic 3-D scene. We find that tracking emerges when pretraining a network with this architecture. Mapping the trajectory of the learnt Gaussians onto the image plane gives zero-shot tracking performance comparable to state-of-the-art. With small-scale finetuning, our models achieve 34.6% improvement on Kinetics, and 13.1% on Kubric datasets, surpassing existing self-supervised video approaches. The project page and code are publicly available at https://videogmae.org/ and https://github.com/tekotan/video-gmae.

Problem

Research questions and friction points this paper is trying to address.

Self-supervised video representation learning via 3-D Gaussians

Emergent tracking from pretraining with Gaussian splat sequences

Zero-shot and fine-tuned tracking performance on video datasets

Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-supervised video representation learning using Gaussian splats

Tracking emerges by predicting 3D Gaussians over time

Zero-shot tracking via mapping Gaussian trajectories to image plane

🔎 Similar Papers

No similar papers found.