Score: 0

HiMoE-VLA: Hierarchical Mixture-of-Experts for Generalist Vision-Language-Action Policies

Published: December 5, 2025 | arXiv ID: 2512.05693v1

By: Zhiying Du , Bei Liu , Yaobo Liang and more

The development of foundation models for embodied intelligence critically depends on access to large-scale, high-quality robot demonstration data. Recent approaches have sought to address this challenge by training on large collections of heterogeneous robotic datasets. However, unlike vision or language data, robotic demonstrations exhibit substantial heterogeneity across embodiments and action spaces as well as other prominent variations such as senor configurations and action control frequencies. The lack of explicit designs for handling such heterogeneity causes existing methods to struggle with integrating diverse factors, thereby limiting their generalization and leading to degraded performance when transferred to new settings. In this paper, we present HiMoE-VLA, a novel vision-language-action (VLA) framework tailored to effectively handle diverse robotic data with heterogeneity. Specifically, we introduce a Hierarchical Mixture-of-Experts (HiMoE) architecture for the action module which adaptively handles multiple sources of heterogeneity across layers and gradually abstracts them into shared knowledge representations. Through extensive experimentation with simulation benchmarks and real-world robotic platforms, HiMoE-VLA demonstrates a consistent performance boost over existing VLA baselines, achieving higher accuracy and robust generalization across diverse robots and action spaces. The code and models are publicly available at https://github.com/ZhiyingDu/HiMoE-VLA.

Expertise need not monopolize: Action-Specialized Mixture of Experts for Vision-Language-Action Learning

Robotics

Robots learn to do more tasks faster.

16 Oct 2025 1

92%

MoS-VLA: A Vision-Language-Action Model with One-Shot Skill Adaptation

Robotics

Robots learn new tasks with one example.

18 Oct 2025 1

91%

FedVLA: Federated Vision-Language-Action Learning with Dual Gating Mixture-of-Experts for Robotic Manipulation

Robotics

Trains robots privately without sharing data

4 Aug 2025 0

View PDF Login to Bookmark

HiMoE-VLA: Hierarchical Mixture-of-Experts for Generalist Vision-Language-Action Policies

Technical Abstract

Expertise need not monopolize: Action-Specialized Mixture of Experts for Vision-Language-Action Learning

MoS-VLA: A Vision-Language-Action Model with One-Shot Skill Adaptation

FedVLA: Federated Vision-Language-Action Learning with Dual Gating Mixture-of-Experts for Robotic Manipulation