Score: 2

Benchmarking Vision, Language, & Action Models in Procedurally Generated, Open Ended Action Environments

Published: May 8, 2025 | arXiv ID: 2505.05540v2

By: Pranav Guruprasad , Yangyue Wang , Sudipta Chowdhury and more

Potential Business Impact:

Robots learn new tasks without extra training.

Business Areas:

Natural Language Processing Artificial Intelligence, Data and Analytics, Software

Vision-language-action (VLA) models represent an important step toward general-purpose robotic systems by integrating visual perception, language understanding, and action execution. However, systematic evaluation of these models, particularly their zero-shot generalization capabilities in procedurally out-of-distribution (OOD) environments, remains limited. In this paper, we introduce MultiNet v0.2, a comprehensive benchmark designed to evaluate and analyze the generalization performance of state-of-the-art VLMs and VLAs - including GPT-4o, GPT-4.1, OpenVLA, Pi0 Base, and Pi0 FAST - on diverse procedural tasks from the Procgen benchmark. Our analysis reveals several critical insights: (1) all evaluated models exhibit significant limitations in zero-shot generalization to OOD tasks, with performance heavily influenced by factors such as action representation and task complexity; (2) VLAs generally outperforms other models due to their robust architectural design; and (3) VLM variants demonstrate substantial improvements when constrained appropriately, highlighting the sensitivity of model performance to precise prompt engineering. We release our benchmark, evaluation framework, and findings to enable the assessment of future VLA models and identify critical areas for improvement in their application to out-of-distribution digital tasks.

From Intention to Execution: Probing the Generalization Boundaries of Vision-Language-Action Models

Robotics

Robots learn to do more tasks with better instructions.

11 Jun 2025 1

93%

Benchmarking the Generality of Vision-Language-Action Models

Machine Learning (CS)

Tests if AI can do many different jobs.

12 Dec 2025 1

92%

Experiences from Benchmarking Vision-Language-Action Models for Robotic Manipulation

Robotics

Robots learn to do tasks better by watching and listening.

14 Nov 2025 0

View PDF Login to Bookmark

Country of Origin

🇺🇸 United States

Repos / Data Links

github.com github.com

Page Count

21 pages

Benchmarking Vision, Language, & Action Models in Procedurally Generated, Open Ended Action Environments

Robots learn new tasks without extra training.

Technical Abstract

From Intention to Execution: Probing the Generalization Boundaries of Vision-Language-Action Models

Benchmarking the Generality of Vision-Language-Action Models

Experiences from Benchmarking Vision-Language-Action Models for Robotic Manipulation