Score: 1

OmniVLA: Unifiying Multi-Sensor Perception for Physically-Grounded Multimodal VLA

Published: November 3, 2025 | arXiv ID: 2511.01210v1

By: Heyu Guo , Shanmu Wang , Ruichun Ma and more

BigTech Affiliations: Microsoft

Potential Business Impact:

Robots see and hear better to do tasks.

Business Areas:

Image Recognition Data and Analytics, Software

Vision-language-action (VLA) models have shown strong generalization for action prediction through large-scale vision-language pretraining. However, most existing models rely solely on RGB cameras, limiting their perception and, consequently, manipulation capabilities. We present OmniVLA, an omni-modality VLA model that integrates novel sensing modalities for physically-grounded spatial intelligence beyond RGB perception. The core of our approach is the sensor-masked image, a unified representation that overlays spatially grounded and physically meaningful masks onto the RGB images, derived from sensors including an infrared camera, a mmWave radar, and a microphone array. This image-native unification keeps sensor input close to RGB statistics to facilitate training, provides a uniform interface across sensor hardware, and enables data-efficient learning with lightweight per-sensor projectors. Built on this, we present a multisensory vision-language-action model architecture and train the model based on an RGB-pretrained VLA backbone. We evaluate OmniVLA on challenging real-world tasks where sensor-modality perception is needed to guide the manipulation. OmniVLA achieves an average task success rate of 84%, significantly outperforms both RGB-only and raw-sensor-input baseline models by 59% and 28% respectively, meanwhile showing higher learning efficiency and stronger generalization capability.

OmniVLA: Physically-Grounded Multimodal VLA with Unified Multi-Sensor Perception for Robotic Manipulation

CV and Pattern Recognition

Robots see and hear better to do more tasks.

3 Nov 2025 1

92%

OmniVLA: An Omni-Modal Vision-Language-Action Model for Robot Navigation

Robotics

Robot learns to go anywhere from words or pictures.

23 Sep 2025 0

90%

DepthVLA: Enhancing Vision-Language-Action Models with Depth-Aware Spatial Reasoning

CV and Pattern Recognition

Helps robots understand where things are better.

15 Oct 2025 0

View PDF Login to Bookmark

Country of Origin

🇺🇸 United States

Page Count

8 pages

OmniVLA: Unifiying Multi-Sensor Perception for Physically-Grounded Multimodal VLA

Robots see and hear better to do tasks.

Technical Abstract

OmniVLA: Physically-Grounded Multimodal VLA with Unified Multi-Sensor Perception for Robotic Manipulation

OmniVLA: An Omni-Modal Vision-Language-Action Model for Robot Navigation

DepthVLA: Enhancing Vision-Language-Action Models with Depth-Aware Spatial Reasoning