Score: 1

MVISU-Bench: Benchmarking Mobile Agents for Real-World Tasks by Multi-App, Vague, Interactive, Single-App and Unethical Instructions

Published: August 12, 2025 | arXiv ID: 2508.09057v2

By: Zeyu Huang , Juyuan Wang , Longfeng Chen and more

Potential Business Impact:

Helps phone apps understand and do tricky tasks.

Given the significant advances in Large Vision Language Models (LVLMs) in reasoning and visual understanding, mobile agents are rapidly emerging to meet users' automation needs. However, existing evaluation benchmarks are disconnected from the real world and fail to adequately address the diverse and complex requirements of users. From our extensive collection of user questionnaire, we identified five tasks: Multi-App, Vague, Interactive, Single-App, and Unethical Instructions. Around these tasks, we present \textbf{MVISU-Bench}, a bilingual benchmark that includes 404 tasks across 137 mobile applications. Furthermore, we propose Aider, a plug-and-play module that acts as a dynamic prompt prompter to mitigate risks and clarify user intent for mobile agents. Our Aider is easy to integrate into several frameworks and has successfully improved overall success rates by 19.55\% compared to the current state-of-the-art (SOTA) on MVISU-Bench. Specifically, it achieves success rate improvements of 53.52\% and 29.41\% for unethical and interactive instructions, respectively. Through extensive experiments and analysis, we highlight the gap between existing mobile agents and real-world user expectations.

MVISU-Bench: Benchmarking Mobile Agents for Real-World Tasks by Multi-App, Vague, Interactive, Single-App and Unethical Instructions

Computation and Language

Helps apps understand and follow your tricky instructions.

12 Aug 2025 1

90%

Mobile-Bench-v2: A More Realistic and Comprehensive Benchmark for VLM-based Mobile Agents

Computation and Language

Tests phone apps to make them work better.

17 May 2025 1

88%

Modular and Multi-Path-Aware Offline Benchmarking for Mobile GUI Agents

Artificial Intelligence

Helps AI apps test themselves on phones.

14 Dec 2025 0

View PDF Login to Bookmark

Country of Origin

🇨🇳 China

Page Count

17 pages

MVISU-Bench: Benchmarking Mobile Agents for Real-World Tasks by Multi-App, Vague, Interactive, Single-App and Unethical Instructions

Helps phone apps understand and do tricky tasks.

Technical Abstract

MVISU-Bench: Benchmarking Mobile Agents for Real-World Tasks by Multi-App, Vague, Interactive, Single-App and Unethical Instructions

Mobile-Bench-v2: A More Realistic and Comprehensive Benchmark for VLM-based Mobile Agents

Modular and Multi-Path-Aware Offline Benchmarking for Mobile GUI Agents