Vision Transformers on the Edge: A Comprehensive Survey of Model Compression and Acceleration Strategies
By: Shaibal Saha, Lanyu Xu
Potential Business Impact:
Makes smart computer vision work on small devices.
In recent years, vision transformers (ViTs) have emerged as powerful and promising techniques for computer vision tasks such as image classification, object detection, and segmentation. Unlike convolutional neural networks (CNNs), which rely on hierarchical feature extraction, ViTs treat images as sequences of patches and leverage self-attention mechanisms. However, their high computational complexity and memory demands pose significant challenges for deployment on resource-constrained edge devices. To address these limitations, extensive research has focused on model compression techniques and hardware-aware acceleration strategies. Nonetheless, a comprehensive review that systematically categorizes these techniques and their trade-offs in accuracy, efficiency, and hardware adaptability for edge deployment remains lacking. This survey bridges this gap by providing a structured analysis of model compression techniques, software tools for inference on edge, and hardware acceleration strategies for ViTs. We discuss their impact on accuracy, efficiency, and hardware adaptability, highlighting key challenges and emerging research directions to advance ViT deployment on edge platforms, including graphics processing units (GPUs), application-specific integrated circuit (ASICs), and field-programmable gate arrays (FPGAs). The goal is to inspire further research with a contemporary guide on optimizing ViTs for efficient deployment on edge devices.
Similar Papers
Vision Transformers in Precision Agriculture: A Comprehensive Survey
CV and Pattern Recognition
Helps farmers spot sick plants faster.
Hypergraph Vision Transformers: Images are More than Nodes, More than Edges
CV and Pattern Recognition
Finds better pictures using smart computer vision.
Hands-on Evaluation of Visual Transformers for Object Recognition and Detection
CV and Pattern Recognition
Helps computers see the whole picture, not just parts.