Unified Embodied VLM Reasoning with Robotic Action via Autoregressive Discretized Pre-training
By: Yi Liu , Sukai Wang , Dafeng Wei and more
General-purpose robotic systems operating in open-world environments must achieve both broad generalization and high-precision action execution, a combination that remains challenging for existing Vision-Language-Action (VLA) models. While large Vision-Language Models (VLMs) improve semantic generalization, insufficient embodied reasoning leads to brittle behavior, and conversely, strong reasoning alone is inadequate without precise control. To provide a decoupled and quantitative assessment of this bottleneck, we introduce Embodied Reasoning Intelligence Quotient (ERIQ), a large-scale embodied reasoning benchmark in robotic manipulation, comprising 6K+ question-answer pairs across four reasoning dimensions. By decoupling reasoning from execution, ERIQ enables systematic evaluation and reveals a strong positive correlation between embodied reasoning capability and end-to-end VLA generalization. To bridge the gap from reasoning to precise execution, we propose FACT, a flow-matching-based action tokenizer that converts continuous control into discrete sequences while preserving high-fidelity trajectory reconstruction. The resulting GenieReasoner jointly optimizes reasoning and action in a unified space, outperforming both continuous-action and prior discrete-action baselines in real-world tasks. Together, ERIQ and FACT provide a principled framework for diagnosing and overcoming the reasoning-precision trade-off, advancing robust, general-purpose robotic manipulation.
Similar Papers
Mind to Hand: Purposeful Robotic Control via Embodied Reasoning
Robotics
Robots learn to act with thinking and planning.
Mind to Hand: Purposeful Robotic Control via Embodied Reasoning
Robotics
Robots learn to do tasks by watching and thinking.
Vlaser: Vision-Language-Action Model with Synergistic Embodied Reasoning
CV and Pattern Recognition
Teaches robots to understand and act.