Effective Multi-sensor Conditioning for Street-view Novel-view Synthesis

📅 2026-05-31
📈 Citations: 0
Influential: 0
📄 PDF

career value

199K/year
🤖 AI Summary
This work addresses the significant degradation in rendering quality observed in existing street-view novel view synthesis (NVS) methods when target camera trajectories deviate from the original driving path, largely due to ineffective utilization of multi-sensor data. To overcome this limitation, we propose StreetNVS, a video diffusion-based NVS framework that fuses three complementary signals—sparse LiDAR reprojections, surround-view images, and camera poses—to enable high-fidelity street-scene reconstruction. Our key innovations include a reference-augmented camera attention module leveraging relative ray-level positional encoding and a two-stage curriculum learning strategy that progressively incorporates sparse LiDAR cues, substantially enhancing generalization. Experiments on the Waymo Open Dataset demonstrate that StreetNVS achieves performance on par with state-of-the-art methods relying on 10–100× denser point clouds, while enabling coherent video synthesis for large out-of-trajectory maneuvers such as elevation changes, lane shifts, pull-backs, and rotations.
📝 Abstract
Modern vehicle platforms are equipped with a rich sensor suite, including LiDAR, calibrated multi-camera rigs, and accurate ego-motion, that in principle offers strong signal for re-rendering a driving scene from novel viewpoints. A growing line of recent work leverages video diffusion models for this task, using their generative priors to synthesize plausible novel views from sparse vehicle observations. In practice, however, existing methods exploit only a fragment of this signal, and their quality tends to degrade as the target trajectory departs from the recorded driving path. We argue that this is fundamentally a multi-sensor fusion problem: sparse LiDAR reprojections supply accurate but incomplete metric geometry, surround-view reference imagery supplies dense appearance but no metric depth, and camera poses tie the two together across views. We introduce StreetNVS, a video diffusion framework that jointly conditions on all three signals through a Reference-Enhanced Camera Attention module based on a relative ray-level positional encoding. We develop a two-stage curriculum training strategy that gradually exposes the model to increasingly sparse LiDAR. On the Waymo Open Dataset, StreetNVS substantially outperforms state-of-the-art baselines under sparse LiDAR conditioning, matches methods that rely on 10-100 times denser point clouds. We further show capabilities of synthesizing coherent videos along extreme out-of-trajectory paths such as elevation, lane-shift, pullback, and rotation. Our website: https://streetnvs.github.io
Problem

Research questions and friction points this paper is trying to address.

novel-view synthesis
multi-sensor fusion
street-view rendering
LiDAR conditioning
video diffusion
Innovation

Methods, ideas, or system contributions that make the work stand out.

multi-sensor fusion
video diffusion model
novel-view synthesis
sparse LiDAR conditioning
reference-enhanced attention
🔎 Similar Papers
2024-05-14arXiv.orgCitations: 2