TinyBEV: Cross Modal Knowledge Distillation for Efficient Multi Task Bird's Eye View Perception and Planning

📅 2025-09-22

📈 Citations: 0

✨ Influential: 0

career value

206K/year

🤖 AI Summary

To address the challenge of balancing real-time inference and accuracy for end-to-end autonomous driving under resource constraints, this paper proposes a model-agnostic, multi-stage knowledge distillation framework. It transfers the full-stack capabilities—3D object detection, HD map segmentation, motion/occupancy forecasting, and goal-directed planning—from a large multimodal teacher model (e.g., UniAD) to a lightweight, vision-only bird’s-eye-view (BEV) student model. Our approach innovatively integrates BEV feature representation, cross-modal knowledge transfer, and region-adaptive supervision. Evaluated on nuScenes, the distilled model achieves 39.0 mAP, 1.08 minADE, and a 0.32% collision rate, while running at 11 FPS. With only 28M parameters—78% fewer than the baseline—it operates five times faster, enabling, for the first time, real-time, perception-planning-integrated deployment on a compact vision-only BEV architecture.

Technology Category

Application Category

📝 Abstract

We present TinyBEV, a unified, camera only Bird's Eye View (BEV) framework that distills the full-stack capabilities of a large planning-oriented teacher (UniAD [19]) into a compact, real-time student model. Unlike prior efficient camera only baselines such as VAD[23] and VADv2[7], TinyBEV supports the complete autonomy stack 3D detection, HD-map segmentation, motion forecasting, occupancy prediction, and goal-directed planning within a streamlined 28M-parameter backbone, achieving a 78% reduction in parameters over UniAD [19]. Our model-agnostic, multi-stage distillation strategy combines feature-level, output-level, and adaptive region-aware supervision to effectively transfer high-capacity multi-modal knowledge to a lightweight BEV representation. On nuScenes[4], Tiny-BEV achieves 39.0 mAP for detection, 1.08 minADE for motion forecasting, and a 0.32 collision rate, while running 5x faster (11 FPS) and requiring only camera input. These results demonstrate that full-stack driving intelligence can be retained in resource-constrained settings, bridging the gap between large-scale, multi-modal perception-planning models and deployment-ready real-time autonomy.

Problem

Research questions and friction points this paper is trying to address.

Distilling large planning-oriented teacher capabilities into compact student model

Enabling efficient multi-task BEV perception and planning with camera-only input

Bridging gap between large multi-modal models and real-time deployment needs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Cross-modal knowledge distillation from teacher model

Multi-stage distillation with feature and output supervision

Lightweight 28M-parameter backbone for real-time BEV perception

🔎 Similar Papers

No similar papers found.