🤖 AI Summary
To address the challenge of jointly preserving visual fidelity and motion continuity in long-video generation, this paper proposes a hierarchical frame-rate prediction framework: first generating a low-frame-rate video to capture global spatiotemporal structure, then progressively inserting intermediate frames to increase both spatial resolution and temporal density. Methodologically, we introduce a cross-frame-rate autoregressive mechanism and intra-hierarchy bidirectional attention to model long-range temporal consistency. A multi-stage frame-rate escalation strategy is adopted to enhance inter-frame coherence while maintaining parallel synthesis efficiency. Evaluated on multiple long-video generation benchmarks, our approach achieves state-of-the-art performance, significantly improving both visual quality—measured by sharpness, detail preservation, and structural integrity—and motion naturalness—assessed via optical flow smoothness and temporal plausibility.
📝 Abstract
We present TempoMaster, a novel framework that formulates long video generation as next-frame-rate prediction. Specifically, we first generate a low-frame-rate clip that serves as a coarse blueprint of the entire video sequence, and then progressively increase the frame rate to refine visual details and motion continuity. During generation, TempoMaster employs bidirectional attention within each frame-rate level while performing autoregression across frame rates, thus achieving long-range temporal coherence while enabling efficient and parallel synthesis. Extensive experiments demonstrate that TempoMaster establishes a new state-of-the-art in long video generation, excelling in both visual and temporal quality.