SAM4D: Segment Anything in Camera and LiDAR Streams

📅 2025-06-26

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

To address the challenges of cross-modal alignment, temporal inconsistency, and high annotation cost in camera-LiDAR 4D spatiotemporal segmentation for autonomous driving, this paper proposes the first promptable multimodal 4D segmentation framework. Our method introduces three key innovations: (1) a Unified Multimodal Positional Encoding (UMPE) that achieves geometrically consistent alignment of camera and LiDAR features in 3D space; (2) a Motion-Aware Cross-Modal Memory Attention (MCMA) mechanism to model long-range temporal dynamics and enhance temporal robustness; and (3) a Vision Foundation Model (VFM)-driven fully automated multimodal data engine for efficient generation of high-fidelity pseudo-labels. Evaluated on our newly constructed Waymo-4DSeg benchmark, the proposed framework significantly outperforms existing methods. The pseudo-labeling speed accelerates by three orders of magnitude over manual annotation, while point-wise semantic fidelity is substantially improved.

Technology Category

Application Category

📝 Abstract

We present SAM4D, a multi-modal and temporal foundation model designed for promptable segmentation across camera and LiDAR streams. Unified Multi-modal Positional Encoding (UMPE) is introduced to align camera and LiDAR features in a shared 3D space, enabling seamless cross-modal prompting and interaction. Additionally, we propose Motion-aware Cross-modal Memory Attention (MCMA), which leverages ego-motion compensation to enhance temporal consistency and long-horizon feature retrieval, ensuring robust segmentation across dynamically changing autonomous driving scenes. To avoid annotation bottlenecks, we develop a multi-modal automated data engine that synergizes VFM-driven video masklets, spatiotemporal 4D reconstruction, and cross-modal masklet fusion. This framework generates camera-LiDAR aligned pseudo-labels at a speed orders of magnitude faster than human annotation while preserving VFM-derived semantic fidelity in point cloud representations. We conduct extensive experiments on the constructed Waymo-4DSeg, which demonstrate the powerful cross-modal segmentation ability and great potential in data annotation of proposed SAM4D.

Problem

Research questions and friction points this paper is trying to address.

Aligns camera and LiDAR features in 3D space for segmentation

Enhances temporal consistency in dynamic autonomous driving scenes

Generates pseudo-labels faster than human annotation for training

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified Multi-modal Positional Encoding for 3D alignment

Motion-aware Cross-modal Memory Attention for consistency

Multi-modal automated data engine for fast labeling

🔎 Similar Papers

No similar papers found.

Authors to Follow