Multiscale Video Transformers for Class Agnostic Segmentation in Autonomous Driving

📅 2025-08-20

📈 Citations: 0

✨ Influential: 0

career value

166K/year

🤖 AI Summary

Existing approaches to unknown object detection and class-agnostic segmentation in autonomous driving rely heavily on labeled known categories and incur substantial computational overhead. To address these limitations, this paper proposes a motion-only, multi-scale video Transformer framework. Our method introduces (1) a memory-centric, multi-stage query-memory decoding architecture with scale-specific stochastic token dropping, preserving high-resolution spatiotemporal features while significantly improving efficiency; and (2) end-to-end training with a shared, learnable memory module—eliminating the need for optical flow or vision foundation models. Evaluated on DAVIS’16, KITTI, and Cityscapes, our approach consistently outperforms multi-scale baselines in accuracy, while reducing GPU memory consumption and accelerating inference. These results demonstrate strong practicality and generalization capability for safety-critical robotic systems.

Technology Category

Application Category

📝 Abstract

Ensuring safety in autonomous driving is a complex challenge requiring handling unknown objects and unforeseen driving scenarios. We develop multiscale video transformers capable of detecting unknown objects using only motion cues. Video semantic and panoptic segmentation often relies on known classes seen during training, overlooking novel categories. Recent visual grounding with large language models is computationally expensive, especially for pixel-level output. We propose an efficient video transformer trained end-to-end for class-agnostic segmentation without optical flow. Our method uses multi-stage multiscale query-memory decoding and a scale-specific random drop-token to ensure efficiency and accuracy, maintaining detailed spatiotemporal features with a shared, learnable memory module. Unlike conventional decoders that compress features, our memory-centric design preserves high-resolution information at multiple scales. We evaluate on DAVIS'16, KITTI, and Cityscapes. Our method consistently outperforms multiscale baselines while being efficient in GPU memory and run-time, demonstrating a promising direction for real-time, robust dense prediction in safety-critical robotics.

Problem

Research questions and friction points this paper is trying to address.

Detecting unknown objects using only motion cues

Overcoming reliance on known classes in segmentation

Achieving efficient class-agnostic segmentation without optical flow

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multiscale video transformers without optical flow

Multi-stage query-memory decoding for efficiency

Scale-specific random drop-token preserves resolution

🔎 Similar Papers

No similar papers found.