Energy-Efficient Vision Transformer Inference for Edge-AI Deployment
By: Nursultan Amanzhol, Jurn-Gyu Park
Potential Business Impact:
Finds best AI vision for less power.
The growing deployment of Vision Transformers (ViTs) on energy-constrained devices requires evaluation methods that go beyond accuracy alone. We present a two-stage pipeline for assessing ViT energy efficiency that combines device-agnostic model selection with device-related measurements. We benchmark 13 ViT models on ImageNet-1K and CIFAR-10, running inference on NVIDIA Jetson TX2 (edge device) and an NVIDIA RTX 3050 (mobile GPU). The device-agnostic stage uses the NetScore metric for screening; the device-related stage ranks models with the Sustainable Accuracy Metric (SAM). Results show that hybrid models such as LeViT_Conv_192 reduce energy by up to 53% on TX2 relative to a ViT baseline (e.g., SAM5=1.44 on TX2/CIFAR-10), while distilled models such as TinyViT-11M_Distilled excel on the mobile GPU (e.g., SAM5=1.72 on RTX 3050/CIFAR-10 and SAM5=0.76 on RTX 3050/ImageNet-1K).
Similar Papers
A Study on Inference Latency for Vision Transformers on Mobile Devices
CV and Pattern Recognition
Predicts how fast phone AI can see pictures.
LL-ViT: Edge Deployable Vision Transformers with Look Up Table Neurons
Machine Learning (CS)
Makes smart cameras work faster and use less power.
Edge-Enhanced Vision Transformer Framework for Accurate AI-Generated Image Detection
CV and Pattern Recognition
Finds fake pictures made by computers.