Score: 0

VLA-LPAF: Lightweight Perspective-Adaptive Fusion for Vision-Language-Action to Enable More Unconstrained Robotic Manipulation

Published: September 18, 2025 | arXiv ID: 2509.18183v1

By: Jinyue Bian , Zhaoxing Zhang , Zhengyu Liang and more

Potential Business Impact:

Helps robots follow instructions from different views.

Business Areas:

Autonomous Vehicles Transportation

The Visual-Language-Action (VLA) models can follow text instructions according to visual observations of the surrounding environment. This ability to map multimodal inputs to actions is derived from the training of the VLA model on extensive standard demonstrations. These visual observations captured by third-personal global and in-wrist local cameras are inevitably varied in number and perspective across different environments, resulting in significant differences in the visual features. This perspective heterogeneity constrains the generality of VLA models. In light of this, we first propose the lightweight module VLA-LPAF to foster the perspective adaptivity of VLA models using only 2D data. VLA-LPAF is finetuned using images from a single view and fuses other multiview observations in the latent space, which effectively and efficiently bridge the gap caused by perspective inconsistency. We instantiate our VLA-LPAF framework with the VLA model RoboFlamingo to construct RoboFlamingo-LPAF. Experiments show that RoboFlamingo-LPAF averagely achieves around 8% task success rate improvement on CALVIN, 15% on LIBERO, and 30% on a customized simulation benchmark. We also demonstrate the developed viewadaptive characteristics of the proposed RoboFlamingo-LPAF through real-world tasks.

VLA Models Are More Generalizable Than You Think: Revisiting Physical and Spatial Modeling

Robotics

Fixes robot vision for new angles and looks.

2 Dec 2025 0

91%

FPC-VLA: A Vision-Language-Action Framework with a Supervisor for Failure Prediction and Correction

Robotics

Robots learn to fix their own mistakes.

4 Sep 2025 1

90%

VLA^2: Empowering Vision-Language-Action Models with an Agentic Framework for Unseen Concept Manipulation

Robotics

Helps robots learn to grab new things.

16 Oct 2025 0

View PDF Login to Bookmark

Page Count

9 pages

VLA-LPAF: Lightweight Perspective-Adaptive Fusion for Vision-Language-Action to Enable More Unconstrained Robotic Manipulation

Helps robots follow instructions from different views.

Technical Abstract

VLA Models Are More Generalizable Than You Think: Revisiting Physical and Spatial Modeling

FPC-VLA: A Vision-Language-Action Framework with a Supervisor for Failure Prediction and Correction

VLA^2: Empowering Vision-Language-Action Models with an Agentic Framework for Unseen Concept Manipulation