🤖 AI Summary
Real-time novel view synthesis (NVS) from single-view, narrow-baseline inputs remains hindered by computational inefficiency, model bloat, and reliance on explicit 3D reconstruction—especially on mobile devices.
Method: We propose a lightweight, end-to-end framework tailored for mobile deployment. It features a novel multi-stage training strategy and a compact multi-encoder/decoder architecture; introduces a learnable spatial transformation module to implicitly model 3D image warping; incorporates a parallel occlusion-aware inpainting mechanism for enhanced robustness; and integrates camera pose embeddings for conditional view synthesis.
Contribution/Results: Trained on a subset of Open Images, our method surpasses state-of-the-art approaches: it achieves a 10× speedup in inference latency, reduces memory footprint by 6%, and delivers >30 FPS real-time performance on the Samsung Tab 9+, marking the first demonstration of high-quality, efficient narrow-baseline NVS on commodity mobile hardware.
📝 Abstract
Single-view novel view synthesis (NVS) is a notorious problem due to its ill-posed nature, and often requires large, computationally expensive approaches to produce tangible results. In this paper, we propose CheapNVS: a fully end-to-end approach for narrow baseline single-view NVS based on a novel, efficient multiple encoder/decoder design trained in a multi-stage fashion. CheapNVS first approximates the laborious 3D image warping with lightweight learnable modules that are conditioned on the camera pose embeddings of the target view, and then performs inpainting on the occluded regions in parallel to achieve significant performance gains. Once trained on a subset of Open Images dataset, CheapNVS outperforms the state-of-the-art despite being 10 times faster and consuming 6% less memory. Furthermore, CheapNVS runs comfortably in real-time on mobile devices, reaching over 30 FPS on a Samsung Tab 9+.