SAM4D: Segment Anything in Camera and LiDAR Streams

📅 2025-06-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenges of cross-modal alignment, temporal inconsistency, and high annotation cost in camera-LiDAR 4D spatiotemporal segmentation for autonomous driving, this paper proposes the first promptable multimodal 4D segmentation framework. Our method introduces three key innovations: (1) a Unified Multimodal Positional Encoding (UMPE) that achieves geometrically consistent alignment of camera and LiDAR features in 3D space; (2) a Motion-Aware Cross-Modal Memory Attention (MCMA) mechanism to model long-range temporal dynamics and enhance temporal robustness; and (3) a Vision Foundation Model (VFM)-driven fully automated multimodal data engine for efficient generation of high-fidelity pseudo-labels. Evaluated on our newly constructed Waymo-4DSeg benchmark, the proposed framework significantly outperforms existing methods. The pseudo-labeling speed accelerates by three orders of magnitude over manual annotation, while point-wise semantic fidelity is substantially improved.

Technology Category

Application Category

📝 Abstract
We present SAM4D, a multi-modal and temporal foundation model designed for promptable segmentation across camera and LiDAR streams. Unified Multi-modal Positional Encoding (UMPE) is introduced to align camera and LiDAR features in a shared 3D space, enabling seamless cross-modal prompting and interaction. Additionally, we propose Motion-aware Cross-modal Memory Attention (MCMA), which leverages ego-motion compensation to enhance temporal consistency and long-horizon feature retrieval, ensuring robust segmentation across dynamically changing autonomous driving scenes. To avoid annotation bottlenecks, we develop a multi-modal automated data engine that synergizes VFM-driven video masklets, spatiotemporal 4D reconstruction, and cross-modal masklet fusion. This framework generates camera-LiDAR aligned pseudo-labels at a speed orders of magnitude faster than human annotation while preserving VFM-derived semantic fidelity in point cloud representations. We conduct extensive experiments on the constructed Waymo-4DSeg, which demonstrate the powerful cross-modal segmentation ability and great potential in data annotation of proposed SAM4D.
Problem

Research questions and friction points this paper is trying to address.

Aligns camera and LiDAR features in 3D space for segmentation
Enhances temporal consistency in dynamic autonomous driving scenes
Generates pseudo-labels faster than human annotation for training
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified Multi-modal Positional Encoding for 3D alignment
Motion-aware Cross-modal Memory Attention for consistency
Multi-modal automated data engine for fast labeling
🔎 Similar Papers
No similar papers found.
Jianyun Xu
Jianyun Xu
Alibaba DAMO Academy
3D PerceptionAutonomous Driving
S
Song Wang
Unmanned Vehicle Dept., CaiNiao Inc., Alibaba Group
Z
Ziqian Ni
Unmanned Vehicle Dept., CaiNiao Inc., Alibaba Group
C
Chunyong Hu
Unmanned Vehicle Dept., CaiNiao Inc., Alibaba Group
S
Sheng Yang
Unmanned Vehicle Dept., CaiNiao Inc., Alibaba Group
Jianke Zhu
Jianke Zhu
Professor of Computer Science, Zhejiang University
Computer VisionRobotics
Q
Qiang Li
Unmanned Vehicle Dept., CaiNiao Inc., Alibaba Group