Score: 0

Structuring GUI Elements through Vision Language Models: Towards Action Space Generation

Published: August 22, 2025 | arXiv ID: 2508.16271v2

By: Yi Xu , Yesheng Zhang , Jiajia Liu and more

Potential Business Impact:

Helps computers understand on-screen buttons and menus.

Business Areas:

Natural Language Processing Artificial Intelligence, Data and Analytics, Software

Multimodal large language models (MLLMs) have emerged as pivotal tools in enhancing human-computer interaction. In this paper we focus on the application of MLLMs in the field of graphical user interface (GUI) elements structuring, where they assist in processing user instructions based on screen contents. Despite the promise of MLLMs, their performance in precisely generating UI element coordinates, a critical aspect of GUI understanding, is hindered by the nature of next-token prediction training. This challenge arises from the semantic void surrounding numerical UI coordinates in language representation spaces, necessitating a substantial and diverse dataset to bolster visual module capabilities. To address these limitations, we introduce an IoU-Augmented Maximum Likelihood (IAML) training paradigm. Specifically, our approach involves a novel pipeline for IoU-based coordinate sampling to augment the training data, which considers the proximity to ground truth coordinates. This data augmentation strategy is then employed to fine-tune MLLMs under the IAML paradigm, which is designed to mitigate the exposure bias problem inherent in traditional maximum likelihood estimation. Through extensive experiments, we demonstrate the superior performance of our IAML training approach over traditional training paradigms.

Structuring GUI Elements through Vision Language Models: Towards Action Space Generation

CV and Pattern Recognition

Helps computers understand screen pictures better.

22 Aug 2025 0

90%

MP-GUI: Modality Perception with MLLMs for GUI Understanding

CV and Pattern Recognition

Helps computers understand app screens like people do.

18 Mar 2025 1

89%

GUI-AIMA: Aligning Intrinsic Multimodal Attention with a Context Anchor for GUI Grounding

CV and Pattern Recognition

Lets computers understand where to click on screens.

2 Nov 2025 2

View PDF Login to Bookmark

Country of Origin

🇨🇳 China

Page Count

10 pages

Structuring GUI Elements through Vision Language Models: Towards Action Space Generation

Helps computers understand on-screen buttons and menus.

Technical Abstract

Structuring GUI Elements through Vision Language Models: Towards Action Space Generation

MP-GUI: Modality Perception with MLLMs for GUI Understanding

GUI-AIMA: Aligning Intrinsic Multimodal Attention with a Context Anchor for GUI Grounding