๐ค AI Summary
NeRF training on real-world outdoor scenes suffers from instability due to inaccurate depth estimation, while existing depth regularization methods rely on costly 3D supervision and exhibit poor generalization. To address this, we propose a view-consistent implicit regularization framework: instead of enforcing fixed depth values, we construct a probabilistic view-consistency distribution over ray sampling points via multi-view 2D pixel projections, and introduceโ for the first timeโa depth-pushing loss to suppress spurious geometric structures, eliminating dependence on precise depth labels. Our method jointly learns distribution modeling and geometric optimization by fusing high-level semantic features from foundation models with low-level color features. Evaluated on multiple public benchmarks, our approach significantly outperforms state-of-the-art NeRF variants and depth-regularized methods, achieving substantial improvements in novel-view synthesis quality and reconstruction robustness.
๐ Abstract
Neural Radiance Fields (NeRF) has emerged as a compelling framework for scene representation and 3D recovery. To improve its performance on real-world data, depth regularizations have proven to be the most effective ones. However, depth estimation models not only require expensive 3D supervision in training, but also suffer from generalization issues. As a result, the depth estimations can be erroneous in practice, especially for outdoor unbounded scenes. In this paper, we propose to employ view-consistent distributions instead of fixed depth value estimations to regularize NeRF training. Specifically, the distribution is computed by utilizing both low-level color features and high-level distilled features from foundation models at the projected 2D pixel-locations from per-ray sampled 3D points. By sampling from the view-consistency distributions, an implicit regularization is imposed on the training of NeRF. We also utilize a depth-pushing loss that works in conjunction with the sampling technique to jointly provide effective regularizations for eliminating the failure modes. Extensive experiments conducted on various scenes from public datasets demonstrate that our proposed method can generate significantly better novel view synthesis results than state-of-the-art NeRF variants as well as different depth regularization methods.