RayDer: Scalable Self-Supervised Novel View Synthesis from Real-World Video

πŸ“… 2026-05-29
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

208K/year
πŸ€– AI Summary
Self-supervised novel view synthesis struggles to scale effectively in real-world videos, primarily due to training instability and unpredictable scaling behavior inherent in multi-network architectures. This work proposes RayDer, the first approach to unify this task within a single, scalable model framework. Built upon a feedforward Transformer architecture, RayDer jointly integrates camera estimation, scene reconstruction, and rendering, while explicitly modeling dynamic content as a nuisance factor and relying solely on static scenes for self-supervised training. RayDer exhibits clear power-law scaling trends across varying model sizes and dataset scales, achieving zero-shot open-set performance on par with current state-of-the-art supervised methods and outperforming strategies that leverage mixed static-dynamic training data.
πŸ“ Abstract
Self-supervised novel view synthesis (NVS) remains challenging to scale, despite the abundance of video data, largely due to the brittleness of training on realistic videos and the hard-to-predict scaling behavior of multi-network system designs. We introduce RayDer, a unified, feed-forward transformer that consolidates camera estimation, scene reconstruction, and rendering into a single backbone, turning self-supervised NVS into a well-posed single-model scaling problem. A minimal dynamic state, treated as a nuisance factor, absorbs time-varying content and enables stable training on unconstrained real-world video. Importantly, RayDer keeps static-scene NVS as its target task: dynamic content is leveraged purely as scalable supervision, not reconstructed as in dynamic-scene (4D) NVS. Across multiple model sizes and orders of magnitude in data, RayDer exhibits clean power-law scaling with data and compute, and outperforms static-scene data mixtures. On a large number of benchmarks, RayDer achieves strong zero-shot open-set performance competitive with state-of-the-art supervised approaches. Project Page: https://compvis.github.io/rayder
Problem

Research questions and friction points this paper is trying to address.

self-supervised
novel view synthesis
real-world video
scalability
static-scene
Innovation

Methods, ideas, or system contributions that make the work stand out.

self-supervised novel view synthesis
unified transformer architecture
dynamic content as supervision
scalable NVS
power-law scaling
πŸ”Ž Similar Papers
2023-10-29Neural Information Processing SystemsCitations: 13