🤖 AI Summary
To address the challenges of cross-modal alignment, temporal inconsistency, and high annotation cost in camera-LiDAR 4D spatiotemporal segmentation for autonomous driving, this paper proposes the first promptable multimodal 4D segmentation framework. Our method introduces three key innovations: (1) a Unified Multimodal Positional Encoding (UMPE) that achieves geometrically consistent alignment of camera and LiDAR features in 3D space; (2) a Motion-Aware Cross-Modal Memory Attention (MCMA) mechanism to model long-range temporal dynamics and enhance temporal robustness; and (3) a Vision Foundation Model (VFM)-driven fully automated multimodal data engine for efficient generation of high-fidelity pseudo-labels. Evaluated on our newly constructed Waymo-4DSeg benchmark, the proposed framework significantly outperforms existing methods. The pseudo-labeling speed accelerates by three orders of magnitude over manual annotation, while point-wise semantic fidelity is substantially improved.
📝 Abstract
We present SAM4D, a multi-modal and temporal foundation model designed for promptable segmentation across camera and LiDAR streams. Unified Multi-modal Positional Encoding (UMPE) is introduced to align camera and LiDAR features in a shared 3D space, enabling seamless cross-modal prompting and interaction. Additionally, we propose Motion-aware Cross-modal Memory Attention (MCMA), which leverages ego-motion compensation to enhance temporal consistency and long-horizon feature retrieval, ensuring robust segmentation across dynamically changing autonomous driving scenes. To avoid annotation bottlenecks, we develop a multi-modal automated data engine that synergizes VFM-driven video masklets, spatiotemporal 4D reconstruction, and cross-modal masklet fusion. This framework generates camera-LiDAR aligned pseudo-labels at a speed orders of magnitude faster than human annotation while preserving VFM-derived semantic fidelity in point cloud representations. We conduct extensive experiments on the constructed Waymo-4DSeg, which demonstrate the powerful cross-modal segmentation ability and great potential in data annotation of proposed SAM4D.