🤖 AI Summary
Existing video implicit neural representations struggle to balance decoding speed, training efficiency, and reconstruction quality, while failing to disentangle camera and object motion. This paper proposes GSVR—a 2D Gaussian-based video representation framework—that pioneers the integration of 2D Gaussians with a hybrid deformation field. It explicitly separates camera and object motion via triplane feature encoding and polynomial motion modeling. To adapt to varying motion intensities, we introduce a motion-aware temporal slicing strategy; further, quantization-aware fine-tuning and dynamic GOP partitioning are designed to jointly enhance compression ratio and reconstruction fidelity. On the Bunny dataset, GSVR achieves per-frame training in just 2 seconds, real-time decoding at over 800 FPS (10× faster than SOTA), and PSNR > 35. On the UVG dataset, it outperforms NeRV in rate-distortion performance.
📝 Abstract
Implicit neural representations for video have been recognized as a novel and promising form of video representation. Existing works pay more attention to improving video reconstruction quality but little attention to the decoding speed. However, the high computation of convolutional network used in existing methods leads to low decoding speed. Moreover, these convolution-based video representation methods also suffer from long training time, about 14 seconds per frame to achieve 35+ PSNR on Bunny. To solve the above problems, we propose GSVR, a novel 2D Gaussian-based video representation, which achieves 800+ FPS and 35+ PSNR on Bunny, only needing a training time of $2$ seconds per frame. Specifically, we propose a hybrid deformation field to model the dynamics of the video, which combines two motion patterns, namely the tri-plane motion and the polynomial motion, to deal with the coupling of camera motion and object motion in the video. Furthermore, we propose a Dynamic-aware Time Slicing strategy to adaptively divide the video into multiple groups of pictures(GOP) based on the dynamic level of the video in order to handle large camera motion and non-rigid movements. Finally, we propose quantization-aware fine-tuning to avoid performance reduction after quantization and utilize image codecs to compress Gaussians to achieve a compact representation. Experiments on the Bunny and UVG datasets confirm that our method converges much faster than existing methods and also has 10x faster decoding speed compared to other methods. Our method has comparable performance in the video interpolation task to SOTA and attains better video compression performance than NeRV.