🤖 AI Summary
To address the low 3D shape reconstruction accuracy and poor robustness of purely vision-based, object-level SLAM under sparse and noisy input data in autonomous driving, this paper proposes the first object-level SLAM framework integrating normalized flows into implicit neural fields. It represents vehicle signed distance functions (SDFs) using 16-dimensional latent codes and incorporates semantic-guided keypoint sampling, joint incremental shape optimization, and a unified sparsity-aware loss enforcing consistency among sparse SDF, rendering, and mask contour predictions. A stereo-vision frontend enables high-fidelity reconstruction solely from sparse 3D points generated via bundle adjustment. Experiments on both synthetic and real-world datasets demonstrate reconstruction accuracy comparable to depth-sensor-based methods. Crucially, the framework maintains stable SLAM performance even under extremely sparse inputs, significantly enhancing robustness to noise and generalization capability across diverse scenarios.
📝 Abstract
We propose a novel, vision-only object-level SLAM framework for automotive applications representing 3D shapes by implicit signed distance functions. Our key innovation consists of augmenting the standard neural representation by a normalizing flow network. As a result, achieving strong representation power on the specific class of road vehicles is made possible by compact networks with only 16-dimensional latent codes. Furthermore, the newly proposed architecture exhibits a significant performance improvement in the presence of only sparse and noisy data, which is demonstrated through comparative experiments on synthetic data. The module is embedded into the back-end of a stereo-vision based framework for joint, incremental shape optimization. The loss function is given by a combination of a sparse 3D point-based SDF loss, a sparse rendering loss, and a semantic mask-based silhouette-consistency term. We furthermore leverage semantic information to determine keypoint extraction density in the front-end. Finally, experimental results on real-world data reveal accurate and reliable performance comparable to alternative frameworks that make use of direct depth readings. The proposed method performs well with only sparse 3D points obtained from bundle adjustment, and eventually continues to deliver stable results even under exclusive use of the mask-consistency term.