Score: 0

Think Twice, Act Once: Token-Aware Compression and Action Reuse for Efficient Inference in Vision-Language-Action Models

Published: May 27, 2025 | arXiv ID: 2505.21200v1

By: Xudong Tan , Yaoxin Yang , Peng Ye and more

Potential Business Impact:

Makes robots follow instructions much faster.

Business Areas:

Image Recognition Data and Analytics, Software

Vision-Language-Action (VLA) models have emerged as a powerful paradigm for general-purpose robot control through natural language instructions. However, their high inference cost-stemming from large-scale token computation and autoregressive decoding-poses significant challenges for real-time deployment and edge applications. While prior work has primarily focused on architectural optimization, we take a different perspective by identifying a dual form of redundancy in VLA models: (i) high similarity across consecutive action steps, and (ii) substantial redundancy in visual tokens. Motivated by these observations, we propose FlashVLA, the first training-free and plug-and-play acceleration framework that enables action reuse in VLA models. FlashVLA improves inference efficiency through a token-aware action reuse mechanism that avoids redundant decoding across stable action steps, and an information-guided visual token selection strategy that prunes low-contribution tokens. Extensive experiments on the LIBERO benchmark show that FlashVLA reduces FLOPs by 55.7% and latency by 36.0%, with only a 0.7% drop in task success rate. These results demonstrate the effectiveness of FlashVLA in enabling lightweight, low-latency VLA inference without retraining.

EfficientVLA: Training-Free Acceleration and Compression for Vision-Language-Action Models

CV and Pattern Recognition

Makes robots learn tasks much faster.

11 Jun 2025 0

94%

VLA-Cache: Efficient Vision-Language-Action Manipulation via Adaptive Token Caching

Robotics

Makes robots react faster by remembering what they see.

4 Feb 2025 1

93%

VLA-Cache: Towards Efficient Vision-Language-Action Model via Adaptive Token Caching in Robotic Manipulation

Robotics

Robots act faster by remembering past views.

4 Feb 2025 0

View PDF Login to Bookmark

Country of Origin

🇨🇳 China

Page Count

16 pages

Think Twice, Act Once: Token-Aware Compression and Action Reuse for Efficient Inference in Vision-Language-Action Models

Makes robots follow instructions much faster.

Technical Abstract

EfficientVLA: Training-Free Acceleration and Compression for Vision-Language-Action Models

VLA-Cache: Efficient Vision-Language-Action Manipulation via Adaptive Token Caching

VLA-Cache: Towards Efficient Vision-Language-Action Model via Adaptive Token Caching in Robotic Manipulation