🤖 AI Summary
Existing video tokenization methods rely on fixed-grid patching, leading to overcoding in low-information regions and poor adaptability in distinguishing static versus dynamic content—resulting in high spatiotemporal redundancy and limited generalization. To address this, we propose the Gaussian Video Transformer (GVT): (1) replacing discrete patches with generative 2D Gaussian lattices for information-adaptive spatial rendering; (2) introducing Spatiotemporal Gaussian Embedding (STGE) to extract latent rigid features; and (3) employing Gaussian Set Partitioning (GSP) to explicitly separate static and dynamic components. By unifying rasterization-based rendering with compact set-based modeling, GVT achieves state-of-the-art reconstruction quality on UCF101, Kinetics, and DAVIS. It outperforms MAGVIT-v2 in action recognition while matching its compression efficiency. Overall, GVT significantly enhances the compactness, adaptability, and generalization capability of spatiotemporal representation.
📝 Abstract
Video tokenization procedure is critical for a wide range of video processing tasks. Most existing approaches directly transform video into fixed-grid and patch-wise tokens, which exhibit limited versatility. Spatially, uniformly allocating a fixed number of tokens often leads to over-encoding in low-information regions. Temporally, reducing redundancy remains challenging without explicitly distinguishing between static and dynamic content. In this work, we propose the Gaussian Video Transformer (GVT), a versatile video tokenizer built upon a generative 2D Gaussian Splatting (2DGS) strategy. We first extract latent rigid features from a video clip and represent them with a set of 2D Gaussians generated by our proposed Spatio-Temporal Gaussian Embedding (STGE) mechanism in a feed-forward manner. Such generative 2D Gaussians not only enhance spatial adaptability by assigning higher (resp., lower) rendering weights to regions with higher (resp., lower) information content during rasterization, but also improve generalization by avoiding per-video optimization.To enhance the temporal versatility, we introduce a Gaussian Set Partitioning (GSP) strategy that separates the 2D Gaussians into static and dynamic sets, which explicitly model static content shared across different time-steps and dynamic content specific to each time-step, enabling a compact representation.We primarily evaluate GVT on the video reconstruction, while also assessing its performance on action recognition and compression using the UCF101, Kinetics, and DAVIS datasets. Extensive experiments demonstrate that GVT achieves a state-of-the-art video reconstruction quality, outperforms the baseline MAGVIT-v2 in action recognition, and delivers comparable compression performance.