🤖 AI Summary
This work addresses the limitations of existing large-scale multimodal autonomous driving datasets—such as ZOD—which lack pixel-level semantic segmentation annotations and exhibit severe class imbalance, particularly with critically underrepresented categories like pedestrians and cyclists. To overcome these challenges, the authors propose an efficient automatic annotation pipeline based on the Segment Anything Model (SAM), generating the first large-scale, high-quality pixel-wise masks for ZOD. A human-verified subset of 2,300 frames is curated to ensure annotation reliability. Furthermore, a specialized segmentation model, CLFT-Hybrid, is introduced to tackle extreme class imbalance, achieving 48.1% mIoU on ZOD and 77.5% mIoU on the Iseauto platform, with notable improvements in rare-class performance and demonstrated cross-sensor representation transferability.
📝 Abstract
Dense semantic segmentation is essential for autonomous driving, yet many multi-modal datasets lack pixel-level annotations. The Zenseact Open Dataset (ZOD) provides rich multi-sensor data but only bounding-box labels, limiting its use for segmentation research. Our primary contribution is a Segment Anything Model (SAM)-based annotation pipeline that produces dense, pixel-level annotations for ZOD by converting bounding boxes into semantic masks. In this pilot study, we process over 100,000 frames and manually curate a 2,300-frame subset (36% acceptance rate) to establish a reliable baseline. Using these annotations, we evaluate transformer-based CLFT and CNN-based DeepLabV3+ architectures across diverse weather conditions, achieving up to 48.1% mIoU with CLFT-Hybrid. To address extreme class imbalance, where pedestrians, cyclists, and signs constitute less than 1% of pixels, we explore specialized models targeting rare classes. We further validate the pipeline on the Iseauto autonomous-vehicle platform, achieving 77.5% mIoU, and show that SAM-derived representations transfer effectively across sensor configurations via bidirectional transfer learning. All code and annotations are released to support reproducible research.