🤖 AI Summary
To address insufficient robustness of multimodal perception under adverse weather conditions, this paper proposes the Camera-LiDAR Fusion Transformer (CLFT)—the first method to deeply integrate visual Transformer self-attention mechanisms into a cross-modal point cloud–image fusion segmentation framework. CLFT comprises dual-stream Vision Transformer (ViT) encoders, a cross-modal feature alignment module, and a learnable token-level interaction mechanism between point clouds and images, augmented by a weather-robust training strategy. Evaluated on benchmarks including nuScenes, CLFT achieves state-of-the-art semantic segmentation accuracy, notably improving pixel-level recognition of small objects—such as pedestrians, cyclists, and traffic signs—under rain and low-light conditions. This work establishes a novel, high-precision, and weather-resilient paradigm for joint multimodal segmentation in autonomous driving systems.
📝 Abstract
This paper presents Camera-LiDAR Fusion Transformer (CLFT) models for traffic object segmentation, which leverage the fusion of camera and LiDAR data using vision transformers. Building on the methodology of visual transformers that exploit the self-attention mechanism, we extend segmentation capabilities with additional classification options to a diverse class of objects including cyclists, traffic signs, and pedestrians across diverse weather conditions. Despite good performance, the models face challenges under adverse conditions which underscores the need for further optimization to enhance performance in darkness and rain. In summary, the CLFT models offer a compelling solution for autonomous driving perception, advancing the state-of-the-art in multimodal fusion and object segmentation, with ongoing efforts required to address existing limitations and fully harness their potential in practical deployments.