Score: 1

Cross-Platform Scaling of Vision-Language-Action Models from Edge to Cloud GPUs

Published: September 15, 2025 | arXiv ID: 2509.11480v1

By: Amir Taherin , Juyi Lin , Arash Akbari and more

Potential Business Impact:

Robots learn tasks better on less power.

Business Areas:

Image Recognition Data and Analytics, Software

Vision-Language-Action (VLA) models have emerged as powerful generalist policies for robotic control, yet their performance scaling across model architectures and hardware platforms, as well as their associated power budgets, remain poorly understood. This work presents an evaluation of five representative VLA models -- spanning state-of-the-art baselines and two newly proposed architectures -- targeting edge and datacenter GPU platforms. Using the LIBERO benchmark, we measure accuracy alongside system-level metrics, including latency, throughput, and peak memory usage, under varying edge power constraints and high-performance datacenter GPU configurations. Our results identify distinct scaling trends: (1) architectural choices, such as action tokenization and model backbone size, strongly influence throughput and memory footprint; (2) power-constrained edge devices exhibit non-linear performance degradation, with some configurations matching or exceeding older datacenter GPUs; and (3) high-throughput variants can be achieved without significant accuracy loss. These findings provide actionable insights when selecting and optimizing VLAs across a range of deployment constraints. Our work challenges current assumptions about the superiority of datacenter hardware for robotic inference.

Efficient Vision-Language-Action Models for Embodied Manipulation: A Systematic Survey

Robotics

Makes robots understand and do tasks faster.

20 Oct 2025 0

91%

Efficient Vision-Language-Action Models for Embodied Manipulation: A Systematic Survey

Robotics

Makes robots understand and do tasks faster.

20 Oct 2025 0

91%

Efficient Vision-Language-Action Models for Embodied Manipulation: A Systematic Survey

Robotics

Makes robots understand and do tasks faster.

20 Oct 2025 0

View PDF Login to Bookmark

Country of Origin

🇺🇸 United States

Page Count

5 pages

Cross-Platform Scaling of Vision-Language-Action Models from Edge to Cloud GPUs

Robots learn tasks better on less power.

Technical Abstract

Efficient Vision-Language-Action Models for Embodied Manipulation: A Systematic Survey

Efficient Vision-Language-Action Models for Embodied Manipulation: A Systematic Survey

Efficient Vision-Language-Action Models for Embodied Manipulation: A Systematic Survey