🤖 AI Summary
Implicit neural video representations (e.g., NeRV) suffer from slow encoding/decoding and high GPU memory consumption. To address these limitations, this work proposes an efficient video representation and compression framework based on deformable 2D Gaussian splatting. Our key innovation is the first introduction of dynamic, time-varying 2D Gaussian deformation modeling, enabled by a multi-plane spatiotemporal encoder and a lightweight decoder. Crucially, temporal gradients drive the prediction of Gaussian parameters—including position, shape, and color—explicitly exploiting inter-frame redundancy. Experiments demonstrate that our method reduces GPU memory usage by 78.4% and accelerates training and decoding by 5.5× and 12.5×, respectively, compared to NeRV, while maintaining competitive reconstruction quality. This achieves a superior trade-off between fidelity and efficiency, significantly enhancing scalability and practical applicability for real-world video processing tasks.
📝 Abstract
Implicit Neural Representation for Videos (NeRV) has introduced a novel paradigm for video representation and compression, outperforming traditional codecs. As model size grows, however, slow encoding and decoding speed and high memory consumption hinder its application in practice. To address these limitations, we propose a new video representation and compression method based on 2D Gaussian Splatting to efficiently handle video data. Our proposed deformable 2D Gaussian Splatting dynamically adapts the transformation of 2D Gaussians at each frame, significantly reducing memory cost. Equipped with a multi-plane-based spatiotemporal encoder and a lightweight decoder, it predicts changes in color, coordinates, and shape of initialized Gaussians, given the time step. By leveraging temporal gradients, our model effectively captures temporal redundancy at negligible cost, significantly enhancing video representation efficiency. Our method reduces GPU memory usage by up to 78.4%, and significantly expedites video processing, achieving 5.5x faster training and 12.5x faster decoding compared to the state-of-the-art NeRV methods.