🤖 AI Summary
Existing 3D/4D generation and reconstruction methods perform physics-based alignment independently at each stage, leading to accumulated geometric misalignment across stages. This work proposes a zero-shot geometric guidance framework that explicitly integrates pose-free scene reconstruction into the generative process, enabling joint optimization of generation and reconstruction. Our key contribution is the design of the first camera-pose-agnostic dual-geometry reward function, embedded within diffusion or autoregressive generative frameworks to provide gradient-based geometric supervision. This enables end-to-end, unsupervised, zero-shot geometric alignment. Evaluated on multiple benchmarks, our method achieves significant improvements in depth, surface normal, and motion consistency: 3D structural error is reduced by 27%, and 4D temporal geometric coherence is substantially enhanced.
📝 Abstract
Recent progress in 3D/4D scene generation emphasizes the importance of physical alignment throughout video generation and scene reconstruction. However, existing methods improve the alignment separately at each stage, making it difficult to manage subtle misalignments arising from another stage. Here, we present SteerX, a zero-shot inference-time steering method that unifies scene reconstruction into the generation process, tilting data distributions toward better geometric alignment. To this end, we introduce two geometric reward functions for 3D/4D scene generation by using pose-free feed-forward scene reconstruction models. Through extensive experiments, we demonstrate the effectiveness of SteerX in improving 3D/4D scene generation.