Context-Driven Performance Modeling for Causal Inference Operators on Neural Processing Units
By: Neelesh Gupta , Rakshith Jayanth , Dhruv Parikh and more
Potential Business Impact:
Makes AI understand long texts on small devices.
The proliferation of large language models (LLMs) has driven demand for long context inference on resource constrained edge devices. However, deploying these models on Neural Processing Units (NPUs) presents significant challenges due to the architectural mismatch: quadratic complexity of standard attention mechanisms conflicts with memory and compute patterns of edge accelerators. This paper presents a comprehensive performance analysis of various causal inference operators on a modern NPU. We benchmark standard quadratic attention against several sub-quadratic alternatives, including structured state-space and linear attention models. Our analysis reveals that while sub-quadratic methods offer superior scalability, they introduce distinct computational bottlenecks on the NPU's specialized execution units. We identify that quadratic attention becomes severely memory-bound, suffering from cache inefficiency and pipeline stalls exceeding 95% at long contexts. In contrast, sub-quadratic models can become compute-bound on programmable vector cores. These findings provide critical insights for the co-design of hardware-aware models and optimization strategies to enable on-device AI inference with long-contexts.
Similar Papers
Scaling LLM Test-Time Compute with Mobile NPU on Smartphones
Distributed, Parallel, and Cluster Computing
Makes small AI models run as fast as big ones.
Edge Deployment of Small Language Models, a comprehensive comparison of CPU, GPU and NPU backends
Performance
Makes AI run faster on small, cheap devices.
From Principles to Practice: A Systematic Study of LLM Serving on Multi-core NPUs
Hardware Architecture
Makes AI understand faster on special chips.