🤖 AI Summary
Existing visual geometry models struggle with long-sequence streaming video due to unbounded growth of key-value (KV) cache, leading to GPU memory exhaustion and unstable 3D reconstruction over time. This work proposes a training-free streaming visual geometry Transformer framework that integrates a self-selective caching mechanism and a dynamic anchor preservation strategy. These components jointly compress the KV cache and mitigate geometric drift while maintaining compatibility with FlashAttention. The method enables processing of arbitrarily long video streams under constant memory and computational overhead. It achieves state-of-the-art reconstruction accuracy on both indoor and outdoor datasets, including ultra-long sequences, and represents the first approach capable of efficient and stable 3D geometric reconstruction from videos of unlimited duration.
📝 Abstract
Reconstructing 3D geometry from streaming video requires continuous inference under bounded resources. Recent geometric foundation models achieve impressive reconstruction quality through all-to-all attention, yet their quadratic cost confines them to short, offline sequences. Causal-attention variants such as StreamVGGT enable single-pass streaming but accumulate an ever-growing KV cache, exhausting GPU memory within hundreds of frames and precluding the long-horizon deployment that motivates streaming inference in the first place. We present OVGGT, a training-free framework that bounds both memory and compute to a fixed budget regardless of sequence length. Our approach combines Self-Selective Caching, which leverages FFN residual magnitudes to compress the KV cache while remaining fully compatible with FlashAttention, with Dynamic Anchor Protection, which shields coordinate-critical tokens from eviction to suppress geometric drift over extended trajectories. Extensive experiments on indoor, outdoor, and ultra-long sequence benchmarks demonstrate that OVGGT processes arbitrarily long videos within a constant VRAM envelope while achieving state-of-the-art 3D geometric accuracy.