UITron-Speech: Towards Automated GUI Agents Based on Speech Instructions
By: Wenkang Han , Zhixiong Zeng , Jing Huang and more
Potential Business Impact:
Lets computers control apps using your voice.
Autonomous agents for Graphical User Interfaces (GUIs) are revolutionizing human-computer interaction, yet their reliance on text-based instructions imposes limitations on accessibility and convenience, particularly in hands-free scenarios. To address this issue, we propose replacing text with speech as the instruction input modality for GUI agents, and introduce UITron-Speech, which is the first end-to-end GUI agent capable of directly processing speech instructions and on-device screenshots to predict user actions. To tackle the problem of data scarcity, we synthesize high-quality speech instruction datasets using a random-speaker text-to-speech model. Additionally, we design a mixed-modality training strategy to mitigate the inherent modality imbalance in pre-trained foundation models. Furthermore, we conduct a statistical analysis of the distribution of GUI grounding prediction errors and propose a training-free two-step grounding refinement method to alleviate minor localization deviations. Extensive experiments on multiple benchmarks demonstrate that UITron-Speech achieves robust performance and superior adaptability, underscoring the feasibility and potential of speech-driven GUI agents for more accessible and intelligent human-computer interaction. Our code and datasets are available at https://github.com/UITron-hub/UITron-Speech.
Similar Papers
UItron: Foundational GUI Agent with Advanced Perception and Planning
CV and Pattern Recognition
Helps computers control phones and computers automatically.
TongUI: Building Generalized GUI Agents by Learning from Multimodal Web Tutorials
CV and Pattern Recognition
Teaches computers to do tasks by watching videos.
UI-E2I-Synth: Advancing GUI Grounding with Large-Scale Instruction Synthesis
Human-Computer Interaction
Teaches computers to understand screen instructions better.