Score: 0

Compressor-VLA: Instruction-Guided Visual Token Compression for Efficient Robotic Manipulation

Published: November 24, 2025 | arXiv ID: 2511.18950v1

By: Juntao Gao , Feiyang Ye , Jing Zhang and more

Potential Business Impact:

Helps robots see and act faster.

Business Areas:

Natural Language Processing Artificial Intelligence, Data and Analytics, Software

Vision-Language-Action (VLA) models have emerged as a powerful paradigm in Embodied AI. However, the significant computational overhead of processing redundant visual tokens remains a critical bottleneck for real-time robotic deployment. While standard token pruning techniques can alleviate this, these task-agnostic methods struggle to preserve task-critical visual information. To address this challenge, simultaneously preserving both the holistic context and fine-grained details for precise action, we propose Compressor-VLA, a novel hybrid instruction-conditioned token compression framework designed for efficient, task-oriented compression of visual information in VLA models. The proposed Compressor-VLA framework consists of two token compression modules: a Semantic Task Compressor (STC) that distills holistic, task-relevant context, and a Spatial Refinement Compressor (SRC) that preserves fine-grained spatial details. This compression is dynamically modulated by the natural language instruction, allowing for the adaptive condensation of task-relevant visual information. Experimentally, extensive evaluations demonstrate that Compressor-VLA achieves a competitive success rate on the LIBERO benchmark while reducing FLOPs by 59% and the visual token count by over 3x compared to its baseline. The real-robot deployments on a dual-arm robot platform validate the model's sim-to-real transferability and practical applicability. Moreover, qualitative analyses reveal that our instruction guidance dynamically steers the model's perceptual focus toward task-relevant objects, thereby validating the effectiveness of our approach.

Token Expand-Merge: Training-Free Token Compression for Vision-Language-Action Models

Robotics

Makes robots think and act much faster.

10 Dec 2025 1

91%

SemanticVLA: Semantic-Aligned Sparsification and Enhancement for Efficient Robotic Manipulation

CV and Pattern Recognition

Helps robots understand and do tasks better.

13 Nov 2025 2

91%

VLA-Pruner: Temporal-Aware Dual-Level Visual Token Pruning for Efficient Vision-Language-Action Inference

CV and Pattern Recognition

Makes robots see and act faster.

20 Nov 2025 1

View PDF Login to Bookmark

Page Count

11 pages

Compressor-VLA: Instruction-Guided Visual Token Compression for Efficient Robotic Manipulation

Helps robots see and act faster.

Technical Abstract

Token Expand-Merge: Training-Free Token Compression for Vision-Language-Action Models

SemanticVLA: Semantic-Aligned Sparsification and Enhancement for Efficient Robotic Manipulation

VLA-Pruner: Temporal-Aware Dual-Level Visual Token Pruning for Efficient Vision-Language-Action Inference