Score: 0

Evaluating the Energy Efficiency of NPU-Accelerated Machine Learning Inference on Embedded Microcontrollers

Published: September 22, 2025 | arXiv ID: 2509.17533v1

By: Anastasios Fanariotis, Theofanis Orphanoudakis, Vasilis Fotopoulos

Potential Business Impact:

Makes tiny computers run smart programs faster, cheaper.

Business Areas:

Intelligent Systems Artificial Intelligence, Data and Analytics, Science and Engineering

The deployment of machine learning (ML) models on microcontrollers (MCUs) is constrained by strict energy, latency, and memory requirements, particularly in battery-operated and real-time edge devices. While software-level optimizations such as quantization and pruning reduce model size and computation, hardware acceleration has emerged as a decisive enabler for efficient embedded inference. This paper evaluates the impact of Neural Processing Units (NPUs) on MCU-based ML execution, using the ARM Cortex-M55 core combined with the Ethos-U55 NPU on the Alif Semiconductor Ensemble E7 development board as a representative platform. A rigorous measurement methodology was employed, incorporating per-inference net energy accounting via GPIO-triggered high-resolution digital multimeter synchronization and idle-state subtraction, ensuring accurate attribution of energy costs. Experimental results across six representative ML models -including MiniResNet, MobileNetV2, FD-MobileNet, MNIST, TinyYolo, and SSD-MobileNet- demonstrate substantial efficiency gains when inference is offloaded to the NPU. For moderate to large networks, latency improvements ranged from 7x to over 125x, with per-inference net energy reductions up to 143x. Notably, the NPU enabled execution of models unsupported on CPU-only paths, such as SSD-MobileNet, highlighting its functional as well as efficiency advantages. These findings establish NPUs as a cornerstone of energy-aware embedded AI, enabling real-time, power-constrained ML inference at the MCU level.

Edge Deployment of Small Language Models, a comprehensive comparison of CPU, GPU and NPU backends

Performance

Makes AI run faster on small, cheap devices.

27 Nov 2025 1

88%

Lightweight Software Kernels and Hardware Extensions for Efficient Sparse Deep Neural Networks on Microcontrollers

Machine Learning (CS)

Makes small computers run smart programs faster.

8 Mar 2025 1

88%

Scaling LLM Test-Time Compute with Mobile NPU on Smartphones

Distributed, Parallel, and Cluster Computing

Makes small AI models run as fast as big ones.

27 Sep 2025 3

View PDF Login to Bookmark

Country of Origin

🇬🇷 Greece

Page Count

6 pages

Evaluating the Energy Efficiency of NPU-Accelerated Machine Learning Inference on Embedded Microcontrollers

Makes tiny computers run smart programs faster, cheaper.

Technical Abstract

Edge Deployment of Small Language Models, a comprehensive comparison of CPU, GPU and NPU backends

Lightweight Software Kernels and Hardware Extensions for Efficient Sparse Deep Neural Networks on Microcontrollers

Scaling LLM Test-Time Compute with Mobile NPU on Smartphones