Score: 0

Bare-Metal Tensor Virtualization: Overcoming the Memory Wall in Edge-AI Inference on ARM64

Published: January 6, 2026 | arXiv ID: 2601.03324v1

By: Bugra Kilictas, Faruk Alpay

Potential Business Impact:

Makes AI run faster on your phone.

Business Areas:

Virtual Desktop Software

The deployment of Large Language Models (LLMs) on edge devices is fundamentally constrained by the "Memory Wall" the bottleneck where data movement latency outstrips arithmetic throughput. Standard inference runtimes often incur significant overhead through high-level abstractions, dynamic dispatch, and unaligned memory access patterns. In this work, we present a novel "Virtual Tensor Core" architecture implemented in software, optimized specifically for ARM64 microarchitectures (Apple Silicon). By bypassing standard library containers in favor of direct memory mapping (mmap) and implementing hand-tuned NEON SIMD kernels, we achieve a form of "Software-Defined Direct Memory Access (DMA)." Our proposed Tensor Virtualization Layout (TVL) guarantees 100% cache line utilization for weight matrices, while our zero-copy loader eliminates initialization latency. Experimental results on a 110M parameter model demonstrate a stable throughput of >60 tokens/second on M2 hardware. While proprietary hardware accelerators (e.g., Apple AMX) can achieve higher peak throughput, our architecture provides a fully open, portable, and deterministic reference implementation for studying the memory bottleneck on general-purpose ARM silicon, meeting the 200ms psycholinguistic latency threshold without opaque dependencies.

The Immutable Tensor Architecture: A Pure Dataflow Approach for Secure, Energy-Efficient AI Inference

Hardware Architecture

Makes phones run smart AI without slow internet.

28 Nov 2025 0

88%

Hardware-Aware Data and Instruction Mapping for AI Tasks: Balancing Parallelism, I/O and Memory Tradeoffs

Hardware Architecture

Makes AI run faster using less power.

4 Sep 2025 0

88%

TZ-LLM: Protecting On-Device Large Language Models with Arm TrustZone

Cryptography and Security

Keeps smart phone AI secrets safe from hackers.

17 Nov 2025 0

View PDF Login to Bookmark

Country of Origin

🇹🇷 Turkey

Page Count

14 pages

Bare-Metal Tensor Virtualization: Overcoming the Memory Wall in Edge-AI Inference on ARM64

Makes AI run faster on your phone.

Technical Abstract

The Immutable Tensor Architecture: A Pure Dataflow Approach for Secure, Energy-Efficient AI Inference

Hardware-Aware Data and Instruction Mapping for AI Tasks: Balancing Parallelism, I/O and Memory Tradeoffs

TZ-LLM: Protecting On-Device Large Language Models with Arm TrustZone