Score: 0

iFlyBot-VLA Technical Report

Published: November 1, 2025 | arXiv ID: 2511.01914v1

By: Yuan Zhang , Chenyu Xue , Wenjie Xu and more

Potential Business Impact:

Robots learn to do tasks by watching and listening.

Business Areas:

Natural Language Processing Artificial Intelligence, Data and Analytics, Software

We introduce iFlyBot-VLA, a large-scale Vision-Language-Action (VLA) model trained under a novel framework. The main contributions are listed as follows: (1) a latent action model thoroughly trained on large-scale human and robotic manipulation videos; (2) a dual-level action representation framework that jointly supervises both the Vision-Language Model (VLM) and the action expert during training; (3) a mixed training strategy that combines robot trajectory data with general QA and spatial QA datasets, effectively enhancing the 3D perceptual and reasoning capabilities of the VLM backbone. Specifically, the VLM is trained to predict two complementary forms of actions: latent actions, derived from our latent action model pretrained on cross-embodiment manipulation data, which capture implicit high-level intentions; and structured discrete action tokens, obtained through frequency-domain transformations of continuous control signals, which encode explicit low-level dynamics. This dual supervision aligns the representation spaces of language, vision, and action, enabling the VLM to directly contribute to action generation. Experimental results on the LIBERO Franka benchmark demonstrate the superiority of our frame-work, while real-world evaluations further show that iFlyBot-VLA achieves competitive success rates across diverse and challenging manipulation tasks. Furthermore, we plan to open-source a portion of our self-constructed dataset to support future research in the community

iFlyBot-VLM Technical Report

Robotics

Robots learn to move and act by seeing.

7 Nov 2025 0

93%

VLA^2: Empowering Vision-Language-Action Models with an Agentic Framework for Unseen Concept Manipulation

Robotics

Helps robots learn to grab new things.

16 Oct 2025 0

92%

Large VLM-based Vision-Language-Action Models for Robotic Manipulation: A Survey

Robotics

Robots learn to do tasks by watching and reading.

18 Aug 2025 1

View PDF Login to Bookmark

Page Count

19 pages

iFlyBot-VLA Technical Report

Robots learn to do tasks by watching and listening.

Technical Abstract

iFlyBot-VLM Technical Report

VLA^2: Empowering Vision-Language-Action Models with an Agentic Framework for Unseen Concept Manipulation

Large VLM-based Vision-Language-Action Models for Robotic Manipulation: A Survey