Cross-Platform Scaling of Vision-Language-Action Models from Edge to Cloud GPUs
By: Amir Taherin , Juyi Lin , Arash Akbari and more
Potential Business Impact:
Robots learn tasks better on less power.
Vision-Language-Action (VLA) models have emerged as powerful generalist policies for robotic control, yet their performance scaling across model architectures and hardware platforms, as well as their associated power budgets, remain poorly understood. This work presents an evaluation of five representative VLA models -- spanning state-of-the-art baselines and two newly proposed architectures -- targeting edge and datacenter GPU platforms. Using the LIBERO benchmark, we measure accuracy alongside system-level metrics, including latency, throughput, and peak memory usage, under varying edge power constraints and high-performance datacenter GPU configurations. Our results identify distinct scaling trends: (1) architectural choices, such as action tokenization and model backbone size, strongly influence throughput and memory footprint; (2) power-constrained edge devices exhibit non-linear performance degradation, with some configurations matching or exceeding older datacenter GPUs; and (3) high-throughput variants can be achieved without significant accuracy loss. These findings provide actionable insights when selecting and optimizing VLAs across a range of deployment constraints. Our work challenges current assumptions about the superiority of datacenter hardware for robotic inference.
Similar Papers
Efficient Vision-Language-Action Models for Embodied Manipulation: A Systematic Survey
Robotics
Makes robots understand and do tasks faster.
Efficient Vision-Language-Action Models for Embodied Manipulation: A Systematic Survey
Robotics
Makes robots understand and do tasks faster.
Efficient Vision-Language-Action Models for Embodied Manipulation: A Systematic Survey
Robotics
Makes robots understand and do tasks faster.