Score: 1

MagicVL-2B: Empowering Vision-Language Models on Mobile Devices with Lightweight Visual Encoders via Curriculum Learning

Published: August 3, 2025 | arXiv ID: 2508.01540v1

By: Yi Liu , Xiao Xu , Zeyu Xu and more

Potential Business Impact:

Lets phones understand pictures and words better.

Vision-Language Models (VLMs) have achieved remarkable breakthroughs in recent years, enabling a diverse array of applications in everyday life. However, the substantial computational and storage demands of VLMs pose significant challenges for their efficient deployment on mobile devices, which represent the most ubiquitous and accessible computing platforms today. In this work, we introduce MagicVL-2B, a novel VLM meticulously optimized for flagship smartphones. MagicVL-2B leverages a lightweight visual encoder with fewer than 100M parameters and features a redesigned dynamic resolution scheme that adaptively generates image tokens without excessive modification of image dimensions. To further enhance the performance of this compact encoder within VLMs, we propose a multimodal curriculum learning strategy that incrementally increases task difficulty and data information density throughout training. This approach substantially improves the model's performance across a variety of sub-tasks. Extensive evaluations on standard VLM benchmarks demonstrate that MagicVL-2B matches the accuracy of current state-of-the-art models while reducing on-device power consumption by 41.1%. These results establish MagicVL-2B as a practical and robust solution for real-world mobile vision-language applications, enabling advanced multimodal intelligence to run directly on smartphones.

HyperVL: An Efficient and Dynamic Multimodal Large Language Model for Edge Devices

CV and Pattern Recognition

Makes smart AI work on your phone.

16 Dec 2025 1

90%

Towards Blind and Low-Vision Accessibility of Lightweight VLMs and Custom LLM-Evals

CV and Pattern Recognition

Helps blind people understand videos better.

13 Nov 2025 0

89%

HybridToken-VLM: Hybrid Token Compression for Vision-Language Models

CV and Pattern Recognition

Lets computers understand pictures better, faster.

9 Dec 2025 0

View PDF Login to Bookmark

Page Count

10 pages

MagicVL-2B: Empowering Vision-Language Models on Mobile Devices with Lightweight Visual Encoders via Curriculum Learning

Lets phones understand pictures and words better.

Technical Abstract

HyperVL: An Efficient and Dynamic Multimodal Large Language Model for Edge Devices

Towards Blind and Low-Vision Accessibility of Lightweight VLMs and Custom LLM-Evals

HybridToken-VLM: Hybrid Token Compression for Vision-Language Models