Score: 0

ThinkAct: Vision-Language-Action Reasoning via Reinforced Visual Latent Planning

Published: July 22, 2025 | arXiv ID: 2507.16815v1

By: Chi-Pin Huang , Yueh-Hua Wu , Min-Hung Chen and more

Potential Business Impact:

Robots learn to plan and fix mistakes.

Business Areas:

Image Recognition Data and Analytics, Software

Vision-language-action (VLA) reasoning tasks require agents to interpret multimodal instructions, perform long-horizon planning, and act adaptively in dynamic environments. Existing approaches typically train VLA models in an end-to-end fashion, directly mapping inputs to actions without explicit reasoning, which hinders their ability to plan over multiple steps or adapt to complex task variations. In this paper, we propose ThinkAct, a dual-system framework that bridges high-level reasoning with low-level action execution via reinforced visual latent planning. ThinkAct trains a multimodal LLM to generate embodied reasoning plans guided by reinforcing action-aligned visual rewards based on goal completion and trajectory consistency. These reasoning plans are compressed into a visual plan latent that conditions a downstream action model for robust action execution on target environments. Extensive experiments on embodied reasoning and robot manipulation benchmarks demonstrate that ThinkAct enables few-shot adaptation, long-horizon planning, and self-correction behaviors in complex embodied AI tasks.

ThinkAct: Vision-Language-Action Reasoning via Reinforced Visual Latent Planning

CV and Pattern Recognition

Robots learn to plan and fix mistakes.

22 Jul 2025 0

92%

Reasoning-VLA: A Fast and General Vision-Language-Action Reasoning Model for Autonomous Driving

CV and Pattern Recognition

Helps self-driving cars drive smarter and faster.

25 Nov 2025 1

92%

DualVLA: Building a Generalizable Embodied Agent via Partial Decoupling of Reasoning and Action

CV and Pattern Recognition

Teaches robots to act and think better.

27 Nov 2025 0

View PDF Login to Bookmark

Page Count

22 pages

ThinkAct: Vision-Language-Action Reasoning via Reinforced Visual Latent Planning

Robots learn to plan and fix mistakes.

Technical Abstract

ThinkAct: Vision-Language-Action Reasoning via Reinforced Visual Latent Planning

Reasoning-VLA: A Fast and General Vision-Language-Action Reasoning Model for Autonomous Driving

DualVLA: Building a Generalizable Embodied Agent via Partial Decoupling of Reasoning and Action