Score: 1

DriveMoE: Mixture-of-Experts for Vision-Language-Action Model in End-to-End Autonomous Driving

Published: May 22, 2025 | arXiv ID: 2505.16278v1

By: Zhenjie Yang , Yilin Chai , Xiaosong Jia and more

Potential Business Impact:

Teaches self-driving cars to handle tricky turns.

Business Areas:

Autonomous Vehicles Transportation

End-to-end autonomous driving (E2E-AD) demands effective processing of multi-view sensory data and robust handling of diverse and complex driving scenarios, particularly rare maneuvers such as aggressive turns. Recent success of Mixture-of-Experts (MoE) architecture in Large Language Models (LLMs) demonstrates that specialization of parameters enables strong scalability. In this work, we propose DriveMoE, a novel MoE-based E2E-AD framework, with a Scene-Specialized Vision MoE and a Skill-Specialized Action MoE. DriveMoE is built upon our $\pi_0$ Vision-Language-Action (VLA) baseline (originally from the embodied AI field), called Drive-$\pi_0$. Specifically, we add Vision MoE to Drive-$\pi_0$ by training a router to select relevant cameras according to the driving context dynamically. This design mirrors human driving cognition, where drivers selectively attend to crucial visual cues rather than exhaustively processing all visual information. In addition, we add Action MoE by training another router to activate specialized expert modules for different driving behaviors. Through explicit behavioral specialization, DriveMoE is able to handle diverse scenarios without suffering from modes averaging like existing models. In Bench2Drive closed-loop evaluation experiments, DriveMoE achieves state-of-the-art (SOTA) performance, demonstrating the effectiveness of combining vision and action MoE in autonomous driving tasks. We will release our code and models of DriveMoE and Drive-$\pi_0$.

Expertise need not monopolize: Action-Specialized Mixture of Experts for Vision-Language-Action Learning

Robotics

Robots learn to do more tasks faster.

16 Oct 2025 1

91%

HiMoE-VLA: Hierarchical Mixture-of-Experts for Generalist Vision-Language-Action Policies

Robotics

Robots learn from many different robot videos.

5 Dec 2025 1

91%

MoSE: Skill-by-Skill Mixture-of-Experts Learning for Embodied Autonomous Machines

Artificial Intelligence

Robots learn tasks faster, like humans do.

10 Jul 2025 1

View PDF Login to Bookmark

Page Count

16 pages

DriveMoE: Mixture-of-Experts for Vision-Language-Action Model in End-to-End Autonomous Driving

Teaches self-driving cars to handle tricky turns.

Technical Abstract

Expertise need not monopolize: Action-Specialized Mixture of Experts for Vision-Language-Action Learning

HiMoE-VLA: Hierarchical Mixture-of-Experts for Generalist Vision-Language-Action Policies

MoSE: Skill-by-Skill Mixture-of-Experts Learning for Embodied Autonomous Machines