DeskVision: Large Scale Desktop Region Captioning for Advanced GUI Agents
By: Yibin Xu , Liang Yang , Hao Chen and more
Potential Business Impact:
Teaches computers to understand what's on your screen.
The limitation of graphical user interface (GUI) data has been a significant barrier to the development of GUI agents today, especially for the desktop / computer use scenarios. To address this, we propose an automated GUI data generation pipeline, AutoCaptioner, which generates data with rich descriptions while minimizing human effort. Using AutoCaptioner, we created a novel large-scale desktop GUI dataset, DeskVision, along with the largest desktop test benchmark, DeskVision-Eval, which reflects daily usage and covers diverse systems and UI elements, each with rich descriptions. With DeskVision, we train a new GUI understanding model, GUIExplorer. Results show that GUIExplorer achieves state-of-the-art (SOTA) performance in understanding/grounding visual elements without the need for complex architectural designs. We further validated the effectiveness of the DeskVision dataset through ablation studies on various large visual language models (LVLMs). We believe that AutoCaptioner and DeskVision will significantly advance the development of GUI agents, and will open-source them for the community.
Similar Papers
UI-Vision: A Desktop-centric GUI Benchmark for Visual Perception and Interaction
CV and Pattern Recognition
Helps computers learn to use programs like people.
AUTO-Explorer: Automated Data Collection for GUI Agent
Artificial Intelligence
Teaches computers to understand new apps quickly.
Using GUI Agent for Electronic Design Automation
CV and Pattern Recognition
Automates complex computer design tasks, beating experts.