🤖 AI Summary
This paper addresses unsupervised, label-free generic object segmentation and free-space detection from monocular images. Methodologically, it introduces a LiDAR-distillation-driven self-supervised Stixel-World representation learning framework: (i) it pioneers LiDAR-guided Stixel ground-truth generation and knowledge distillation; (ii) it designs a multi-layer 2D Stixel-World direct prediction architecture enabling instance-level localization of overlapping objects; and (iii) it employs a lightweight CNN with implicit monocular depth modeling for LiDAR-free inference. Contributions include: (1) the first mid-level Stixel semantic representation jointly modeling free space and instance segmentation; (2) rapid scene adaptation using only a small set of unlabeled images; and (3) state-of-the-art accuracy on benchmarks such as KITTI, with significantly reduced model parameters and real-time inference speed.
📝 Abstract
In this work, we present a novel approach for general object segmentation from a monocular image, eliminating the need for manually labeled training data and enabling rapid, straightforward training and adaptation with minimal data. Our model initially learns from LiDAR during the training process, which is subsequently removed from the system, allowing it to function solely on monocular imagery. This study leverages the concept of the Stixel-World to recognize a medium level representation of its surroundings. Our network directly predicts a 2D multi-layer Stixel-World and is capable of recognizing and locating multiple, superimposed objects within an image. Due to the scarcity of comparable works, we have divided the capabilities into modules and present a free space detection in our experiments section. Furthermore, we introduce an improved method for generating Stixels from LiDAR data, which we use as ground truth for our network.