GSVR: 2D Gaussian-based Video Representation for 800+ FPS with Hybrid Deformation Field

📅 2025-07-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing video implicit neural representations struggle to balance decoding speed, training efficiency, and reconstruction quality, while failing to disentangle camera and object motion. This paper proposes GSVR—a 2D Gaussian-based video representation framework—that pioneers the integration of 2D Gaussians with a hybrid deformation field. It explicitly separates camera and object motion via triplane feature encoding and polynomial motion modeling. To adapt to varying motion intensities, we introduce a motion-aware temporal slicing strategy; further, quantization-aware fine-tuning and dynamic GOP partitioning are designed to jointly enhance compression ratio and reconstruction fidelity. On the Bunny dataset, GSVR achieves per-frame training in just 2 seconds, real-time decoding at over 800 FPS (10× faster than SOTA), and PSNR > 35. On the UVG dataset, it outperforms NeRV in rate-distortion performance.

Technology Category

Application Category

📝 Abstract
Implicit neural representations for video have been recognized as a novel and promising form of video representation. Existing works pay more attention to improving video reconstruction quality but little attention to the decoding speed. However, the high computation of convolutional network used in existing methods leads to low decoding speed. Moreover, these convolution-based video representation methods also suffer from long training time, about 14 seconds per frame to achieve 35+ PSNR on Bunny. To solve the above problems, we propose GSVR, a novel 2D Gaussian-based video representation, which achieves 800+ FPS and 35+ PSNR on Bunny, only needing a training time of $2$ seconds per frame. Specifically, we propose a hybrid deformation field to model the dynamics of the video, which combines two motion patterns, namely the tri-plane motion and the polynomial motion, to deal with the coupling of camera motion and object motion in the video. Furthermore, we propose a Dynamic-aware Time Slicing strategy to adaptively divide the video into multiple groups of pictures(GOP) based on the dynamic level of the video in order to handle large camera motion and non-rigid movements. Finally, we propose quantization-aware fine-tuning to avoid performance reduction after quantization and utilize image codecs to compress Gaussians to achieve a compact representation. Experiments on the Bunny and UVG datasets confirm that our method converges much faster than existing methods and also has 10x faster decoding speed compared to other methods. Our method has comparable performance in the video interpolation task to SOTA and attains better video compression performance than NeRV.
Problem

Research questions and friction points this paper is trying to address.

Improve video decoding speed beyond 800 FPS
Reduce training time to 2 seconds per frame
Handle camera and object motion coupling effectively
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hybrid deformation field combines tri-plane and polynomial motion
Dynamic-aware Time Slicing adaptively divides video GOPs
Quantization-aware fine-tuning maintains performance post-compression
🔎 Similar Papers
No similar papers found.
Z
Zhizhuo Pang
College of Intelligence and Computing, Tianjin University
Z
Zhihui Ke
College of Intelligence and Computing, Tianjin University
X
Xiaobo Zhou
College of Intelligence and Computing, Tianjin University
Tie Qiu
Tie Qiu
Tianjin University
Industrial Internet of ThingsBig DataIntelligent Networking