LumosFlow: Motion-Guided Long Video Generation

📅 2025-06-03

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

To address temporal incoherence, unnatural motion, and frame repetition/jitter in long-video generation, this paper proposes a hierarchical motion-guided framework. First, a large-motion text-to-video diffusion model (LMTV-DM) generates highly discriminative keyframes. Then, a latent optical flow diffusion model (LOF-DM), jointly trained with MotionControlNet, explicitly models and synthesizes dense optical flow fields; subsequent flow-driven warping and fine-grained reconstruction achieve 15× high-fidelity frame interpolation. This work introduces the first explicit motion-guidance paradigm, decoupling interpolation into optical flow synthesis and geometric deformation—significantly improving motion continuity and appearance consistency. Our method outperforms state-of-the-art approaches across multiple benchmarks and enables minute-long, high-quality video generation. Code and models will be publicly released.

Technology Category

Application Category

📝 Abstract

Long video generation has gained increasing attention due to its widespread applications in fields such as entertainment and simulation. Despite advances, synthesizing temporally coherent and visually compelling long sequences remains a formidable challenge. Conventional approaches often synthesize long videos by sequentially generating and concatenating short clips, or generating key frames and then interpolate the intermediate frames in a hierarchical manner. However, both of them still remain significant challenges, leading to issues such as temporal repetition or unnatural transitions. In this paper, we revisit the hierarchical long video generation pipeline and introduce LumosFlow, a framework introduce motion guidance explicitly. Specifically, we first employ the Large Motion Text-to-Video Diffusion Model (LMTV-DM) to generate key frames with larger motion intervals, thereby ensuring content diversity in the generated long videos. Given the complexity of interpolating contextual transitions between key frames, we further decompose the intermediate frame interpolation into motion generation and post-hoc refinement. For each pair of key frames, the Latent Optical Flow Diffusion Model (LOF-DM) synthesizes complex and large-motion optical flows, while MotionControlNet subsequently refines the warped results to enhance quality and guide intermediate frame generation. Compared with traditional video frame interpolation, we achieve 15x interpolation, ensuring reasonable and continuous motion between adjacent frames. Experiments show that our method can generate long videos with consistent motion and appearance. Code and models will be made publicly available upon acceptance. Our project page: https://jiahaochen1.github.io/LumosFlow/

Problem

Research questions and friction points this paper is trying to address.

Generating long videos with temporal coherence and visual appeal

Overcoming temporal repetition and unnatural transitions in video synthesis

Achieving large-motion interpolation between key frames effectively

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses Large Motion Text-to-Video Diffusion Model

Decomposes interpolation into motion and refinement

Achieves 15x interpolation with continuous motion

🔎 Similar Papers

No similar papers found.

Authors to Follow