AxLLM: accelerator architecture for large language models with computation reuse capability
By: Soroush Ahadi, Mehdi Modarressi, Masoud Daneshtalab
Potential Business Impact:
Makes AI models run faster and use less power.
Large language models demand massive computational power and memory resources, posing significant challenges for efficient deployment. While quantization has been widely explored to reduce model size and computation, this paper demonstrates an additional benefit: quantization increases parameter locality, creating opportunities for computation reuse. Building on this insight, we propose AxLLM, a hardware accelerator architecture designed for quantized models. Axllm introduces a novel redundancy elimination technique that caches and reuses multiplication results for repeated weight values, substantially reducing redundant operations. The architecture features dual multiply and reuse pipelines, efficiently supporting both base models and LoRA fine-tuned models without altering parameters, retraining, or requiring offline preprocessing. Experimental results show that AxLLM achieves up to 90% reduction in computations, delivering 28% lower energy consumption and a 1.7x speedup over baseline execution. These results highlight Axllm as a scalable and efficient solution for accelerating LLMs on specialized hardware.
Similar Papers
AccLLM: Accelerating Long-Context LLM Inference Via Algorithm-Hardware Co-Design
Hardware Architecture
Makes smart computer programs run faster on small devices.
LUT-LLM: Efficient Large Language Model Inference with Memory-based Computations on FPGAs
Hardware Architecture
Makes AI run faster and use less power.
ELUTQ: Efficient LUT-Aware Quantization for Deploying Large Language Models on Edge Devices
Machine Learning (CS)
Makes smart AI run on phones, faster and smaller.