Score: 1

SkyMoE: A Vision-Language Foundation Model for Enhancing Geospatial Interpretation with Mixture of Experts

Published: December 2, 2025 | arXiv ID: 2512.02517v1

By: Jiaqi Liu , Ronghao Fu , Lang Sun and more

Potential Business Impact:

Helps satellites understand Earth better from space.

Business Areas:

Geospatial Data and Analytics, Navigation and Mapping

The emergence of large vision-language models (VLMs) has significantly enhanced the efficiency and flexibility of geospatial interpretation. However, general-purpose VLMs remain suboptimal for remote sensing (RS) tasks. Existing geospatial VLMs typically adopt a unified modeling strategy and struggle to differentiate between task types and interpretation granularities, limiting their ability to balance local detail perception and global contextual understanding. In this paper, we present SkyMoE, a Mixture-of-Experts (MoE) vision-language model tailored for multimodal, multi-task RS interpretation. SkyMoE employs an adaptive router that generates task- and granularity-aware routing instructions, enabling specialized large language model experts to handle diverse sub-tasks. To further promote expert decoupling and granularity sensitivity, we introduce a context-disentangled augmentation strategy that creates contrastive pairs between local and global features, guiding experts toward level-specific representation learning. We also construct MGRS-Bench, a comprehensive benchmark covering multiple RS interpretation tasks and granularity levels, to evaluate generalization in complex scenarios. Extensive experiments on 21 public datasets demonstrate that SkyMoE achieves state-of-the-art performance across tasks, validating its adaptability, scalability, and superior multi-granularity understanding in remote sensing.

RingMoE: Mixture-of-Modality-Experts Multi-Modal Foundation Models for Universal Remote Sensing Image Interpretation

CV and Pattern Recognition

Lets computers see Earth better from space.

4 Apr 2025 0

91%

MedMoE: Modality-Specialized Mixture of Experts for Medical Vision-Language Understanding

CV and Pattern Recognition

Helps doctors understand medical images better.

10 Jun 2025 0

90%

A Parameter-Efficient Mixture-of-Experts Framework for Cross-Modal Geo-Localization

CV and Pattern Recognition

Drones find places using words and pictures.

23 Oct 2025 0

View PDF Login to Bookmark

Country of Origin

🇨🇳 China

Page Count

10 pages

SkyMoE: A Vision-Language Foundation Model for Enhancing Geospatial Interpretation with Mixture of Experts

Helps satellites understand Earth better from space.

Technical Abstract

RingMoE: Mixture-of-Modality-Experts Multi-Modal Foundation Models for Universal Remote Sensing Image Interpretation

MedMoE: Modality-Specialized Mixture of Experts for Medical Vision-Language Understanding

A Parameter-Efficient Mixture-of-Experts Framework for Cross-Modal Geo-Localization