Score: 0

LEAP: LLM Inference on Scalable PIM-NoC Architecture with Balanced Dataflow and Fine-Grained Parallelism

Published: September 18, 2025 | arXiv ID: 2509.14781v1

By: Yimin Wang, Yue Jiet Chong, Xuanyao Fong

Potential Business Impact:

Makes AI models run faster and use less power.

Business Areas:

Natural Language Processing Artificial Intelligence, Data and Analytics, Software

Large language model (LLM) inference has been a prevalent demand in daily life and industries. The large tensor sizes and computing complexities in LLMs have brought challenges to memory, computing, and databus. This paper proposes a computation/memory/communication co-designed non-von Neumann accelerator by aggregating processing-in-memory (PIM) and computational network-on-chip (NoC), termed LEAP. The matrix multiplications in LLMs are assigned to PIM or NoC based on the data dynamicity to maximize data locality. Model partition and mapping are optimized by heuristic design space exploration. Dedicated fine-grained parallelism and tiling techniques enable high-throughput dataflow across the distributed resources in PIM and NoC. The architecture is evaluated on Llama 1B/8B/13B models and shows $\sim$2.55$\times$ throughput (tokens/sec) improvement and $\sim$71.94$\times$ energy efficiency (tokens/Joule) boost compared to the A100 GPU.

P3-LLM: An Integrated NPU-PIM Accelerator for LLM Inference Using Hybrid Numerical Formats

Hardware Architecture

Makes AI understand things much faster.

10 Nov 2025 4

90%

P3-LLM: An Integrated NPU-PIM Accelerator for LLM Inference Using Hybrid Numerical Formats

Hardware Architecture

Makes AI think faster using less computer power.

10 Nov 2025 4

90%

PIM-LLM: A High-Throughput Hybrid PIM Architecture for 1-bit LLMs

Hardware Architecture

Makes AI chat faster and use less power.

31 Mar 2025 1

View PDF Login to Bookmark

Country of Origin

🇸🇬 Singapore

Page Count

9 pages

LEAP: LLM Inference on Scalable PIM-NoC Architecture with Balanced Dataflow and Fine-Grained Parallelism

Makes AI models run faster and use less power.

Technical Abstract

P3-LLM: An Integrated NPU-PIM Accelerator for LLM Inference Using Hybrid Numerical Formats

P3-LLM: An Integrated NPU-PIM Accelerator for LLM Inference Using Hybrid Numerical Formats

PIM-LLM: A High-Throughput Hybrid PIM Architecture for 1-bit LLMs