Score: 0

Token Expand-Merge: Training-Free Token Compression for Vision-Language-Action Models

Published: December 10, 2025 | arXiv ID: 2512.09927v1

By: Yifan Ye , Jiaqi Ma , Jun Cen and more

Vision-Language-Action (VLA) models pretrained on large-scale multimodal datasets have emerged as powerful foundations for robotic perception and control. However, their massive scale, often billions of parameters, poses significant challenges for real-time deployment, as inference becomes computationally expensive and latency-sensitive in dynamic environments. To address this, we propose Token Expand-and-Merge-VLA (TEAM-VLA), a training-free token compression framework that accelerates VLA inference while preserving task performance. TEAM-VLA introduces a dynamic token expansion mechanism that identifies and samples additional informative tokens in the spatial vicinity of attention-highlighted regions, enhancing contextual completeness. These expanded tokens are then selectively merged in deeper layers under action-aware guidance, effectively reducing redundancy while maintaining semantic coherence. By coupling expansion and merging within a single feed-forward pass, TEAM-VLA achieves a balanced trade-off between efficiency and effectiveness, without any retraining or parameter updates. Extensive experiments on LIBERO benchmark demonstrate that TEAM-VLA consistently improves inference speed while maintaining or even surpassing the task success rate of full VLA models. The code is public available on \href{https://github.com/Jasper-aaa/TEAM-VLA}{https://github.com/Jasper-aaa/TEAM-VLA}

Compressor-VLA: Instruction-Guided Visual Token Compression for Efficient Robotic Manipulation

Robotics

Helps robots see and act faster.

24 Nov 2025 0

91%

EfficientVLA: Training-Free Acceleration and Compression for Vision-Language-Action Models

CV and Pattern Recognition

Makes robots learn tasks much faster.

11 Jun 2025 0

90%

MergeVLA: Cross-Skill Model Merging Toward a Generalist Vision-Language-Action Agent

Robotics

Robots learn many tasks in one brain.

24 Nov 2025 0

View PDF Login to Bookmark

Token Expand-Merge: Training-Free Token Compression for Vision-Language-Action Models

Technical Abstract

Compressor-VLA: Instruction-Guided Visual Token Compression for Efficient Robotic Manipulation

EfficientVLA: Training-Free Acceleration and Compression for Vision-Language-Action Models

MergeVLA: Cross-Skill Model Merging Toward a Generalist Vision-Language-Action Agent