🤖 AI Summary
This work addresses single-image novel view synthesis (NVS), proposing a lightweight and efficient approach that leverages rasterized 3D scene coordinate maps (“pointmaps”) as geometric priors to condition a pre-trained 2D diffusion model. The method integrates a Reference Attention module with the ControlNet architecture and introduces a dedicated pointmap feature embedding mechanism, achieving an optimal trade-off between generation fidelity and multi-view geometric consistency. Unlike existing single-image NVS methods, our approach significantly reduces parameter count, enables high-fidelity, cross-view-consistent novel view generation on multiple real-world datasets, and requires neither explicit 3D reconstruction nor fine-tuning of the diffusion backbone—thereby facilitating easier deployment and pedagogical demonstration.
📝 Abstract
In this paper, we present PointmapDiffusion, a novel framework for single-image novel view synthesis (NVS) that utilizes pre-trained 2D diffusion models. Our method is the first to leverage pointmaps (i.e. rasterized 3D scene coordinates) as a conditioning signal, capturing geometric prior from the reference images to guide the diffusion process. By embedding reference attention blocks and a ControlNet for pointmap features, our model balances between generative capability and geometric consistency, enabling accurate view synthesis across varying viewpoints. Extensive experiments on diverse real-world datasets demonstrate that PointmapDiffusion achieves high-quality, multi-view consistent results with significantly fewer trainable parameters compared to other baselines for single-image NVS tasks.