Score: 0

Benchmarking the Generality of Vision-Language-Action Models

Published: December 12, 2025 | arXiv ID: 2512.11315v1

By: Pranav Guruprasad , Sudipta Chowdhury , Harsh Sikka and more

Generalist multimodal agents are expected to unify perception, language, and control - operating robustly across diverse real world domains. However, current evaluation practices remain fragmented across isolated benchmarks, making it difficult to assess whether today's foundation models truly generalize beyond their training distributions. We introduce MultiNet v1.0, a unified benchmark for measuring the cross domain generality of vision language models (VLMs) and vision language action models (VLAs) across six foundational capability regimes. Visual grounding, spatial reasoning, tool use, physical commonsense, multi agent coordination, and continuous robot control. Evaluating GPT 5, Pi0, and Magma, we find that no model demonstrates consistent generality. All exhibit substantial degradation on unseen domains, unfamiliar modalities, or cross domain task shifts despite strong performance within their training distributions.These failures manifest as modality misalignment, output format instability, and catastrophic knowledge degradation under domain transfer.Our findings reveal a persistent gap between the aspiration of generalist intelligence and the actual capabilities of current foundation models.MultiNet v1.0 provides a standardized evaluation substrate for diagnosing these gaps and guiding the development of future generalist agents.Code, data, and leaderboards are publicly available.

Benchmarking Vision, Language, & Action Models in Procedurally Generated, Open Ended Action Environments

CV and Pattern Recognition

Robots learn new tasks without extra training.

8 May 2025 2

91%

From Intention to Execution: Probing the Generalization Boundaries of Vision-Language-Action Models

Robotics

Robots learn to do more tasks with better instructions.

11 Jun 2025 1

91%

An Open-Source Software Toolkit & Benchmark Suite for the Evaluation and Adaptation of Multimodal Action Models

Machine Learning (CS)

Tests how AI sees, talks, and acts.

10 Jun 2025 1

View PDF Login to Bookmark

Benchmarking the Generality of Vision-Language-Action Models

Technical Abstract

Benchmarking Vision, Language, & Action Models in Procedurally Generated, Open Ended Action Environments

From Intention to Execution: Probing the Generalization Boundaries of Vision-Language-Action Models

An Open-Source Software Toolkit & Benchmark Suite for the Evaluation and Adaptation of Multimodal Action Models