Score: 1

IntentionVLA: Generalizable and Efficient Embodied Intention Reasoning for Human-Robot Interaction

Published: October 9, 2025 | arXiv ID: 2510.07778v1

By: Yandu Chen , Kefan Gu , Yuqing Wen and more

Potential Business Impact:

Robots understand what you want without you saying it.

Business Areas:

Autonomous Vehicles Transportation

Vision-Language-Action (VLA) models leverage pretrained vision-language models (VLMs) to couple perception with robotic control, offering a promising path toward general-purpose embodied intelligence. However, current SOTA VLAs are primarily pretrained on multimodal tasks with limited relevance to embodied scenarios, and then finetuned to map explicit instructions to actions. Consequently, due to the lack of reasoning-intensive pretraining and reasoning-guided manipulation, these models are unable to perform implicit human intention reasoning required for complex, real-world interactions. To overcome these limitations, we propose \textbf{IntentionVLA}, a VLA framework with a curriculum training paradigm and an efficient inference mechanism. Our proposed method first leverages carefully designed reasoning data that combine intention inference, spatial grounding, and compact embodied reasoning, endowing the model with both reasoning and perception capabilities. In the following finetuning stage, IntentionVLA employs the compact reasoning outputs as contextual guidance for action generation, enabling fast inference under indirect instructions. Experimental results show that IntentionVLA substantially outperforms $\pi_0$, achieving 18\% higher success rates with direct instructions and 28\% higher than ECoT under intention instructions. On out-of-distribution intention tasks, IntentionVLA achieves over twice the success rate of all baselines, and further enables zero-shot human-robot interaction with 40\% success rate. These results highlight IntentionVLA as a promising paradigm for next-generation human-robot interaction (HRI) systems.

DualVLA: Building a Generalizable Embodied Agent via Partial Decoupling of Reasoning and Action

CV and Pattern Recognition

Teaches robots to act and think better.

27 Nov 2025 0

92%

INTENTION: Inferring Tendencies of Humanoid Robot Motion Through Interactive Intuition and Grounded VLM

Robotics

Robots learn to do new tasks by watching and remembering.

6 Aug 2025 0

92%

Mind to Hand: Purposeful Robotic Control via Embodied Reasoning

Robotics

Robots learn to do tasks by watching and thinking.

9 Dec 2025 0

View PDF Login to Bookmark

Page Count

9 pages

IntentionVLA: Generalizable and Efficient Embodied Intention Reasoning for Human-Robot Interaction

Robots understand what you want without you saying it.

Technical Abstract

DualVLA: Building a Generalizable Embodied Agent via Partial Decoupling of Reasoning and Action

INTENTION: Inferring Tendencies of Humanoid Robot Motion Through Interactive Intuition and Grounded VLM

Mind to Hand: Purposeful Robotic Control via Embodied Reasoning