🤖 AI Summary
To address motion blur and temporal inconsistency caused by large displacements in video frame interpolation under highly dynamic scenes, this paper proposes an enhanced diffusion-based framework. Methodologically, it introduces three key innovations: (1) a novel Transformer-based latent-space tokenizer that efficiently encodes high-fidelity intermediate-frame representations; (2) a frame-difference embedding mechanism that explicitly models nonlinear motion priors from the first and last frames; and (3) a stride-aware temporal self-attention module to strengthen long-range temporal consistency modeling. Quantitative evaluation demonstrates state-of-the-art performance: LPIPS improves by nearly 10% on DAVIS and SNU-FILM, while PSNR/SSIM gains reach 8% on the DAIN-HD benchmark. Qualitatively, the method significantly enhances structural fidelity and motion coherence, particularly in challenging large-motion scenarios.
📝 Abstract
Handling complex or nonlinear motion patterns has long posed challenges for video frame interpolation. Although recent advances in diffusion-based methods offer improvements over traditional optical flow-based approaches, they still struggle to generate sharp, temporally consistent frames in scenarios with large motion. To address this limitation, we introduce EDEN, an Enhanced Diffusion for high-quality large-motion vidEo frame iNterpolation. Our approach first utilizes a transformer-based tokenizer to produce refined latent representations of the intermediate frames for diffusion models. We then enhance the diffusion transformer with temporal attention across the process and incorporate a start-end frame difference embedding to guide the generation of dynamic motion. Extensive experiments demonstrate that EDEN achieves state-of-the-art results across popular benchmarks, including nearly a 10% LPIPS reduction on DAVIS and SNU-FILM, and an 8% improvement on DAIN-HD.