Score: 1

HPIM: Heterogeneous Processing-In-Memory-based Accelerator for Large Language Models Inference

Published: September 16, 2025 | arXiv ID: 2509.12993v1

By: Cenlin Duan , Jianlei Yang , Rubing Yang and more

Potential Business Impact:

Makes AI understand words much faster.

Business Areas:

Natural Language Processing Artificial Intelligence, Data and Analytics, Software

The deployment of large language models (LLMs) presents significant challenges due to their enormous memory footprints, low arithmetic intensity, and stringent latency requirements, particularly during the autoregressive decoding stage. Traditional compute-centric accelerators, such as GPUs, suffer from severe resource underutilization and memory bandwidth bottlenecks in these memory-bound workloads. To overcome these fundamental limitations, we propose HPIM, the first memory-centric heterogeneous Processing-In-Memory (PIM) accelerator that integrates SRAM-PIM and HBM-PIM subsystems designed specifically for LLM inference. HPIM employs a software-hardware co-design approach that combines a specialized compiler framework with a heterogeneous hardware architecture. It intelligently partitions workloads based on their characteristics: latency-critical attention operations are mapped to the SRAM-PIM subsystem to exploit its ultra-low latency and high computational flexibility, while weight-intensive GEMV computations are assigned to the HBM-PIM subsystem to leverage its high internal bandwidth and large storage capacity. Furthermore, HPIM introduces a tightly coupled pipeline strategy across SRAM-PIM and HBM-PIM subsystems to maximize intra-token parallelism, thereby significantly mitigating serial dependency of the autoregressive decoding stage. Comprehensive evaluations using a cycle-accurate simulator demonstrate that HPIM significantly outperforms state-of-the-art accelerators, achieving a peak speedup of up to 22.8x compared to the NVIDIA A100 GPU. Moreover, HPIM exhibits superior performance over contemporary PIM-based accelerators, highlighting its potential as a highly practical and scalable solution for accelerating large-scale LLM inference.

HPIM: Heterogeneous Processing-In-Memory-based Accelerator for Large Language Models Inference

Hardware Architecture

Makes AI models run much faster and use less power.

16 Sep 2025 1

92%

HH-PIM: Dynamic Optimization of Power and Performance with Heterogeneous-Hybrid PIM for Edge AI Devices

Hardware Architecture

Saves energy for smart devices doing AI.

2 Apr 2025 0

91%

PIM-LLM: A High-Throughput Hybrid PIM Architecture for 1-bit LLMs

Hardware Architecture

Makes AI chat faster and use less power.

31 Mar 2025 1

View PDF Login to Bookmark

Country of Origin

🇨🇳 China

Page Count

14 pages

HPIM: Heterogeneous Processing-In-Memory-based Accelerator for Large Language Models Inference

Makes AI understand words much faster.

Technical Abstract

HPIM: Heterogeneous Processing-In-Memory-based Accelerator for Large Language Models Inference

HH-PIM: Dynamic Optimization of Power and Performance with Heterogeneous-Hybrid PIM for Edge AI Devices

PIM-LLM: A High-Throughput Hybrid PIM Architecture for 1-bit LLMs