GUI-Xplore: Empowering Generalizable GUI Agents with One Exploration
By: Yuchen Sun , Shanhui Zhao , Tao Yu and more
Potential Business Impact:
Helps computers learn to use any app.
GUI agents hold significant potential to enhance the experience and efficiency of human-device interaction. However, current methods face challenges in generalizing across applications (apps) and tasks, primarily due to two fundamental limitations in existing datasets. First, these datasets overlook developer-induced structural variations among apps, limiting the transferability of knowledge across diverse software environments. Second, many of them focus solely on navigation tasks, which restricts their capacity to represent comprehensive software architectures and complex user interactions. To address these challenges, we introduce GUI-Xplore, a dataset meticulously designed to enhance cross-application and cross-task generalization via an exploration-and-reasoning framework. GUI-Xplore integrates pre-recorded exploration videos providing contextual insights, alongside five hierarchically structured downstream tasks designed to comprehensively evaluate GUI agent capabilities. To fully exploit GUI-Xplore's unique features, we propose Xplore-Agent, a GUI agent framework that combines Action-aware GUI Modeling with Graph-Guided Environment Reasoning. Further experiments indicate that Xplore-Agent achieves a 10% improvement over existing methods in unfamiliar environments, yet there remains significant potential for further enhancement towards truly generalizable GUI agents.
Similar Papers
AUTO-Explorer: Automated Data Collection for GUI Agent
Artificial Intelligence
Teaches computers to understand new apps quickly.
Explorer: Robust Collection of Interactable GUI Elements
Human-Computer Interaction
Lets computers control apps using your voice.
AppAgentX: Evolving GUI Agents as Proficient Smartphone Users
Artificial Intelligence
Makes computer helpers learn faster and smarter.