π€ AI Summary
This work addresses the challenge of texture incompleteness and surface artifacts in novel view synthesis under sparse multi-camera configurations, where limited viewpoints often result in missing visual details. To this end, the authors propose a post-processing inpainting method tailored for real-time 3D streaming, which leverages a multi-view-aware Transformer architecture to perform texture completion independently after rendering, making it compatible with arbitrarily calibrated multi-camera systems. The approach incorporates spatio-temporal embeddings to ensure inter-frame consistency and features a resolution-agnostic design combined with an adaptive patch selection strategy, achieving high visual fidelity while meeting real-time performance constraints. Experimental results demonstrate that, under identical real-time requirements, the proposed method outperforms existing techniques in both image and video quality metrics, establishing a new state-of-the-art trade-off between quality and speed.
π Abstract
High-quality 3D streaming from multiple cameras is crucial for immersive experiences in many AR/VR applications. The limited number of views - often due to real-time constraints - leads to missing information and incomplete surfaces in the rendered images. Existing approaches typically rely on simple heuristics for the hole filling, which can result in inconsistencies or visual artifacts. We propose to complete the missing textures using a novel, application-targeted inpainting method independent of the underlying representation as an image-based post-processing step after the novel view rendering. The method is designed as a standalone module compatible with any calibrated multi-camera system. For this we introduce a multi-view aware, transformer-based network architecture using spatio-temporal embeddings to ensure consistency across frames while preserving fine details. Additionally, our resolution-independent design allows adaptation to different camera setups, while an adaptive patch selection strategy balances inference speed and quality, allowing real-time performance. We evaluate our approach against state-of-the-art inpainting techniques under the same real-time constraints and demonstrate that our model achieves the best trade-off between quality and speed, outperforming competitors in both image and video-based metrics.