DriveMoE: Mixture-of-Experts for Vision-Language-Action Model in End-to-End Autonomous Driving

📅 2025-05-22

📈 Citations: 0

✨ Influential: 0

career value

237K/year

🤖 AI Summary

End-to-end autonomous driving faces dual challenges: modeling multi-view perception and ensuring robust decision-making in rare scenarios (e.g., sharp turns), where existing approaches often suffer from mode averaging. To address this, we propose a vision–action dual Mixture-of-Experts (MoE) architecture. The vision MoE dynamically routes key camera inputs based on scene-aware perception, while the action MoE employs driving-cognition-inspired behavioral specialization to activate task-specific expert modules, thereby avoiding decision smoothing. Our method jointly trains Vision-Language-Action (VLA) modeling, multi-view fusion, and dynamic routing, enabling fine-grained decoupling and coordination between perception and action. Evaluated in the closed-loop Bench2Drive benchmark, our approach achieves state-of-the-art performance, demonstrating significantly improved generalization to complex and rare scenarios as well as enhanced control precision.

Technology Category

Application Category

📝 Abstract

End-to-end autonomous driving (E2E-AD) demands effective processing of multi-view sensory data and robust handling of diverse and complex driving scenarios, particularly rare maneuvers such as aggressive turns. Recent success of Mixture-of-Experts (MoE) architecture in Large Language Models (LLMs) demonstrates that specialization of parameters enables strong scalability. In this work, we propose DriveMoE, a novel MoE-based E2E-AD framework, with a Scene-Specialized Vision MoE and a Skill-Specialized Action MoE. DriveMoE is built upon our $pi_0$ Vision-Language-Action (VLA) baseline (originally from the embodied AI field), called Drive-$pi_0$. Specifically, we add Vision MoE to Drive-$pi_0$ by training a router to select relevant cameras according to the driving context dynamically. This design mirrors human driving cognition, where drivers selectively attend to crucial visual cues rather than exhaustively processing all visual information. In addition, we add Action MoE by training another router to activate specialized expert modules for different driving behaviors. Through explicit behavioral specialization, DriveMoE is able to handle diverse scenarios without suffering from modes averaging like existing models. In Bench2Drive closed-loop evaluation experiments, DriveMoE achieves state-of-the-art (SOTA) performance, demonstrating the effectiveness of combining vision and action MoE in autonomous driving tasks. We will release our code and models of DriveMoE and Drive-$pi_0$.

Problem

Research questions and friction points this paper is trying to address.

Handling diverse and complex driving scenarios effectively

Processing multi-view sensory data robustly

Avoiding mode averaging in existing autonomous driving models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Scene-Specialized Vision MoE for dynamic camera selection

Skill-Specialized Action MoE for diverse driving behaviors

Combining vision and action MoE achieves SOTA performance

🔎 Similar Papers

MiniDrive: More Efficient Vision-Language Models with Multi-Level 2D Features as Text Tokens for Autonomous Driving