Score: 1

Guiding VLM Agents with Process Rewards at Inference Time for GUI Navigation

Published: April 22, 2025 | arXiv ID: 2504.16073v1

By: Zhiyuan Hu , Shiyun Xiong , Yifan Zhang and more

Potential Business Impact:

Helps computers control apps better by learning from mistakes.

Business Areas:

Navigation Navigation and Mapping

Recent advancements in visual language models (VLMs) have notably enhanced their capabilities in handling complex Graphical User Interface (GUI) interaction tasks. Despite these improvements, current frameworks often struggle to generate correct actions in challenging GUI environments. State-of-the-art commercial VLMs are black-boxes, and fine-tuning open-source VLMs for GUI tasks requires significant resources. Additionally, existing trajectory-level evaluation and refinement techniques frequently fall short due to delayed feedback and local optimization issues. To address these challenges, we propose an approach that guides VLM agents with process supervision by a reward model during GUI navigation and control at inference time. This guidance allows the VLM agent to optimize actions at each inference step, thereby improving performance in both static and dynamic environments. In particular, our method demonstrates significant performance gains in three GUI navigation tasks, achieving a 3.4% improvement in single step action accuracy for static environments, along with a around 33% increase in task success rate in one dynamic environment. With further integration of trajectory reflection and retry mechanisms, we also demonstrate even greater enhancement in task success.

GUI Exploration Lab: Enhancing Screen Navigation in Agents via Multi-Turn Reinforcement Learning

CV and Pattern Recognition

Lets AI learn to use computer programs by itself.

2 Dec 2025 0

89%

A Survey on GUI Agents with Foundation Models Enhanced by Reinforcement Learning

Artificial Intelligence

Helps computers understand and use apps like people.

29 Apr 2025 0

89%

SeeNav-Agent: Enhancing Vision-Language Navigation with Visual Prompt and Step-Level Policy Optimization

Machine Learning (CS)

Helps robots see and follow directions better.

2 Dec 2025 2

View PDF Login to Bookmark

Page Count

17 pages

Guiding VLM Agents with Process Rewards at Inference Time for GUI Navigation

Helps computers control apps better by learning from mistakes.

Technical Abstract

GUI Exploration Lab: Enhancing Screen Navigation in Agents via Multi-Turn Reinforcement Learning

A Survey on GUI Agents with Foundation Models Enhanced by Reinforcement Learning

SeeNav-Agent: Enhancing Vision-Language Navigation with Visual Prompt and Step-Level Policy Optimization