Score: 2

MEGA-GUI: Multi-stage Enhanced Grounding Agents for GUI Elements

Published: November 17, 2025 | arXiv ID: 2511.13087v1

By: SeokJoo Kwak , Jihoon Kim , Boyoun Kim and more

BigTech Affiliations: Samsung

Potential Business Impact:

Helps computers understand screen instructions better.

Business Areas:

Augmented Reality Hardware, Software

Graphical User Interface (GUI) grounding - the task of mapping natural language instructions to screen coordinates - is essential for autonomous agents and accessibility technologies. Existing systems rely on monolithic models or one-shot pipelines that lack modularity and fail under visual clutter and ambiguous instructions. We introduce MEGA-GUI, a multi-stage framework that separates grounding into coarse Region-of-Interest (ROI) selection and fine-grained element grounding, orchestrated by specialized vision-language agents. MEGA-GUI features a bidirectional ROI zoom algorithm that mitigates spatial dilution and a context-aware rewriting agent that reduces semantic ambiguity. Our analysis reveals complementary strengths and weaknesses across vision-language models at different visual scales, and we show that leveraging this modular structure achieves consistently higher accuracy than monolithic approaches. On the visually dense ScreenSpot-Pro benchmark, MEGA-GUI attains 73.18% accuracy, and on the semantically complex OSWorld-G benchmark it reaches 68.63%, surpassing previously reported results. Code and the Grounding Benchmark Toolkit (GBT) are available at https://github.com/samsungsds-research-papers/mega-gui.

Chain-of-Ground: Improving GUI Grounding via Iterative Reasoning and Reference Feedback

Artificial Intelligence

Helps computers understand screen instructions better.

1 Dec 2025 1

91%

GUI-AIMA: Aligning Intrinsic Multimodal Attention with a Context Anchor for GUI Grounding

CV and Pattern Recognition

Lets computers understand where to click on screens.

2 Nov 2025 2

90%

GUI-AIMA: Aligning Intrinsic Multimodal Attention with a Context Anchor for GUI Grounding

CV and Pattern Recognition

Helps computers understand where to click on screen.

2 Nov 2025 2

View PDF Login to Bookmark

Country of Origin

🇰🇷 South Korea

Repos / Data Links

github.com

Page Count

26 pages

MEGA-GUI: Multi-stage Enhanced Grounding Agents for GUI Elements

Helps computers understand screen instructions better.

Technical Abstract

Chain-of-Ground: Improving GUI Grounding via Iterative Reasoning and Reference Feedback

GUI-AIMA: Aligning Intrinsic Multimodal Attention with a Context Anchor for GUI Grounding

GUI-AIMA: Aligning Intrinsic Multimodal Attention with a Context Anchor for GUI Grounding