TLB-VFI: Temporal-Aware Latent Brownian Bridge Diffusion for Video Frame Interpolation

📅 2025-07-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing video frame interpolation (VFI) methods face two key limitations: image diffusion models lack explicit temporal modeling capability, while video diffusion models suffer from prohibitive training and inference overhead. To address these challenges, this paper proposes an efficient video diffusion framework. Our method introduces: (1) a 3D wavelet-gated, time-aware autoencoder that jointly extracts compact yet rich spatiotemporal features; and (2) a latent-space Brownian bridge diffusion mechanism incorporating time-aware priors and optical flow guidance—significantly reducing data and parameter dependencies. Evaluated on the most challenging benchmark datasets, our approach achieves a 20% improvement in FID, reduces model parameters by over 20×, accelerates inference by 2.3×, and decreases training data requirements by 9,000×—thereby unifying high-quality synthesis with computational efficiency.

Technology Category

Application Category

📝 Abstract
Video Frame Interpolation (VFI) aims to predict the intermediate frame $I_n$ (we use n to denote time in videos to avoid notation overload with the timestep $t$ in diffusion models) based on two consecutive neighboring frames $I_0$ and $I_1$. Recent approaches apply diffusion models (both image-based and video-based) in this task and achieve strong performance. However, image-based diffusion models are unable to extract temporal information and are relatively inefficient compared to non-diffusion methods. Video-based diffusion models can extract temporal information, but they are too large in terms of training scale, model size, and inference time. To mitigate the above issues, we propose Temporal-Aware Latent Brownian Bridge Diffusion for Video Frame Interpolation (TLB-VFI), an efficient video-based diffusion model. By extracting rich temporal information from video inputs through our proposed 3D-wavelet gating and temporal-aware autoencoder, our method achieves 20% improvement in FID on the most challenging datasets over recent SOTA of image-based diffusion models. Meanwhile, due to the existence of rich temporal information, our method achieves strong performance while having 3times fewer parameters. Such a parameter reduction results in 2.3x speed up. By incorporating optical flow guidance, our method requires 9000x less training data and achieves over 20x fewer parameters than video-based diffusion models. Codes and results are available at our project page: https://zonglinl.github.io/tlbvfi_page.
Problem

Research questions and friction points this paper is trying to address.

Efficient video frame interpolation using temporal-aware diffusion
Reducing model size and training data requirements
Improving performance with fewer parameters and faster speed
Innovation

Methods, ideas, or system contributions that make the work stand out.

Temporal-aware autoencoder extracts video temporal information
3D-wavelet gating enhances efficiency and performance
Optical flow guidance reduces training data needs
🔎 Similar Papers