ODMA: On-Demand Memory Allocation Framework for LLM Serving on LPDDR-Class Accelerators
By: Guoqiang Zou , Wanyu Wang , Hao Zheng and more
Potential Business Impact:
Makes AI models run faster on cheaper chips.
Serving large language models (LLMs) on accelerators with poor random-access bandwidth (e.g., LPDDR5-based) is limited by current memory managers. Static pre-allocation wastes memory, while fine-grained paging (e.g., PagedAttention) is ill-suited due to high random-access costs. Existing HBM-centric solutions do not exploit the characteristics of random-access-constrained memory (RACM) accelerators like Cambricon MLU370. We present ODMA, an on-demand memory allocation framework for RACM. ODMA addresses distribution drift and heavy-tailed requests by coupling a lightweight length predictor with dynamic bucket partitioning and a large-bucket safeguard. Boundaries are periodically updated from live traces to maximize utilization. On Alpaca and Google-NQ, ODMA improves prediction accuracy of prior work significantly (e.g., from 82.68% to 93.36%). Serving DeepSeek-R1-Distill-Qwen-7B on Cambricon MLU370-X4, ODMA raises memory utilization from 55.05% to 72.45% and improves RPS and TPS by 29% and 27% over static baselines. This demonstrates that hardware-aware allocation unlocks efficient LLM serving on RACM platforms.
Similar Papers
DOLMA: A Data Object Level Memory Disaggregation Framework for HPC Applications
Distributed, Parallel, and Cluster Computing
Makes computers use more memory without slowing down.
Distributed Dynamic Associative Memory via Online Convex Optimization
Machine Learning (CS)
Helps many computers learn together faster.
UMDAM: A Unified Data Layout and DRAM Address Mapping for Heterogenous NPU-PIM
Distributed, Parallel, and Cluster Computing
Makes AI on phones run much faster.