๐ค AI Summary
Video frame prediction suffers from inaccurate spatiotemporal dynamic modeling under complex motion, primarily due to the limited representational capacity of fixed convolutional kernels for diverse motion patterns. To address this, we propose a flow-guided multi-scale deformable feature sampling mechanism that tightly integrates optical flow estimation with deformable convolution, enabling motion-adaptive spatial sampling to jointly model global scene transformations and local object motions. Our method embeds this mechanism into an end-to-end spatiotemporal prediction network with multi-scale feature fusion, achieving real-time inference while significantly improving motion consistency and detail fidelity. Evaluated on eight standard MPEG test sequences, our approach achieves state-of-the-art performance, outperforming prior methods by +1.0 dB in PSNR.
๐ Abstract
Video frame prediction remains a fundamental challenge in computer vision with direct implications for autonomous systems, video compression, and media synthesis. We present FG-DFPN, a novel architecture that harnesses the synergy between optical flow estimation and deformable convolutions to model complex spatio-temporal dynamics. By guiding deformable sampling with motion cues, our approach addresses the limitations of fixed-kernel networks when handling diverse motion patterns. The multi-scale design enables FG-DFPN to simultaneously capture global scene transformations and local object movements with remarkable precision. Our experiments demonstrate that FG-DFPN achieves state-of-the-art performance on eight diverse MPEG test sequences, outperforming existing methods by 1dB PSNR while maintaining competitive inference speeds. The integration of motion cues with adaptive geometric transformations makes FG-DFPN a promising solution for next-generation video processing systems that require high-fidelity temporal predictions. The model and instructions to reproduce our results will be released at: https://github.com/KUIS-AI-Tekalp-Research Group/frame-prediction