How Auxiliary Reasoning Unleashes GUI Grounding in VLMs
By: Weiming Li , Yan Shao , Jing Yang and more
Potential Business Impact:
Helps computers understand where things are on screens.
Graphical user interface (GUI) grounding is a fundamental task for building GUI agents. However, general vision-language models (VLMs) struggle with this task due to a lack of specific optimization. We identify a key gap in this paper: while VLMs exhibit significant latent grounding potential, as demonstrated by their performance measured by Pointing Game, they underperform when tasked with outputting explicit coordinates. To address this discrepancy, and bypass the high data and annotation costs of current fine-tuning approaches, we propose three zero-shot auxiliary reasoning methods. By providing explicit spatial cues such as axes, grids and labeled intersections as part of the input image, these methods enable VLMs to articulate their implicit spatial understanding capabilities. We evaluate these methods on four GUI grounding benchmarks across seven open-source and proprietary VLMs. The evaluation results demonstrate that the proposed methods substantially improve the performance of GUI grounding.
Similar Papers
Towards Understanding Visual Grounding in Visual Language Models
CV and Pattern Recognition
Helps computers understand what's in pictures.
SATGround: A Spatially-Aware Approach for Visual Grounding in Remote Sensing
CV and Pattern Recognition
Finds things in satellite pictures using words.
No Labels, No Problem: Training Visual Reasoners with Multimodal Verifiers
CV and Pattern Recognition
AI learns to see and think better.