Retrieve What's Missing: Coverage-Maximizing Retrieval for Consistent Long Video Generation

📅 2026-06-01

📈 Citations: 0

✨ Influential: 0

career value

223K/year

🤖 AI Summary

Long-term autoregressive video generation often suffers from structural distortions due to insufficient geometric consistency over extended sequences. To address this, this work proposes COVRAG, a novel framework that introduces, for the first time, a depth-based coverage maximization strategy. Leveraging a pretrained 3D prior, COVRAG constructs a target-view coverage map as a lightweight memory and iteratively retrieves historical frames that maximally complement currently missing regions through residual coverage gain. The framework further incorporates a sliding-window depth caching mechanism, which significantly enhances geometric consistency and scalability in long-form video generation while maintaining low latency. Experimental results demonstrate that COVRAG substantially outperforms existing baseline methods on the RealEstate10K and DL3DV10K datasets.

📝 Abstract

Maintaining long-term geometric consistency remains challenging for long-horizon autoregressive video generation. Memory-augmented generative models address this by retrieving historical frames, but their effectiveness depends on two key design choices: what 3D-geometric evidence should represent past observations, and how memory frames should be selected from this evidence. Existing methods often rely on camera poses or field-of-view overlap, which are lightweight but too coarse to reason about pixel-wise visibility, or use explicit 3D reconstruction, which provides fine-grained evidence but is costly to maintain over long rollouts. We propose Coverage-Maximizing Retrieval-Augmented Generation (COVRAG), a depth-based memory retrieval framework that uses pretrained 3D priors to construct a target-view coverage map as lightweight 3D memory evidence. For frame selection, COVRAG maximizes residual coverage gain, iteratively retrieving frames that explain target-view regions not covered by the current context or previously selected memories. To improve scalability in long-video generation, we introduce sliding-window depth caching for efficient geometry estimation. Experiments on RealEstate10K and DL3DV10K show that COVRAG improves long-horizon geometric consistency while maintaining low latency compared to baselines.

Problem

Research questions and friction points this paper is trying to address.

long video generation

geometric consistency

memory retrieval

3D reconstruction

autoregressive generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

coverage-maximizing retrieval

3D priors

memory-augmented generation