InternData-A1: Pioneering High-Fidelity Synthetic Data for Pre-training Generalist Policy
By: Yang Tian , Yuyin Yang , Yiman Xie and more
Potential Business Impact:
Makes robots learn from fake experiences.
Recent works explore how real and synthetic data contribute to Vision-Language-Action (VLA) models' generalization. While current VLA models have shown the strong effectiveness of large-scale real-robot pre-training, synthetic data has not previously demonstrated comparable capability at scale. This paper provides the first evidence that synthetic data alone can match the performance of the strongest $π$-dataset in pre-training a VLA model, revealing the substantial value of large-scale simulation. The resulting model also exhibits surprisingly zero-shot sim-to-real transfer on several challenging tasks. Our synthetic dataset, InternData-A1, contains over 630k trajectories and 7,433 hours across 4 embodiments, 18 skills, 70 tasks, and 227 scenes, covering rigid, articulated, deformable, and fluid-object manipulation. It is generated through a highly autonomous, fully decoupled, and compositional simulation pipeline that enables long-horizon skill composition, flexible task assembly, and heterogeneous embodiments with minimal manual tuning. Using the same architecture as $π_0$, we pre-train a model entirely on InternData-A1 and find that it matches the official $π_0$ across 49 simulation tasks, 5 real-world tasks, and 4 long-horizon dexterous tasks. We release the dataset and will open-source the generation pipeline to broaden access to large-scale robotic data and to lower the barrier to scalable data creation for embodied AI research.
Similar Papers
A tutorial note on collecting simulated data for vision-language-action models
Robotics
Robots learn tasks from seeing, hearing, and doing.
GraspVLA: a Grasping Foundation Model Pre-trained on Billion-scale Synthetic Action Data
Robotics
Robots learn to grab anything from fake practice.
$π_{0.5}$: a Vision-Language-Action Model with Open-World Generalization
Machine Learning (CS)
Robots learn to clean new homes by watching and listening.