Score: 0

AxLLM: accelerator architecture for large language models with computation reuse capability

Published: September 26, 2025 | arXiv ID: 2509.22512v1

By: Soroush Ahadi, Mehdi Modarressi, Masoud Daneshtalab

Potential Business Impact:

Makes AI models run faster and use less power.

Business Areas:

Quantum Computing Science and Engineering

Large language models demand massive computational power and memory resources, posing significant challenges for efficient deployment. While quantization has been widely explored to reduce model size and computation, this paper demonstrates an additional benefit: quantization increases parameter locality, creating opportunities for computation reuse. Building on this insight, we propose AxLLM, a hardware accelerator architecture designed for quantized models. Axllm introduces a novel redundancy elimination technique that caches and reuses multiplication results for repeated weight values, substantially reducing redundant operations. The architecture features dual multiply and reuse pipelines, efficiently supporting both base models and LoRA fine-tuned models without altering parameters, retraining, or requiring offline preprocessing. Experimental results show that AxLLM achieves up to 90% reduction in computations, delivering 28% lower energy consumption and a 1.7x speedup over baseline execution. These results highlight Axllm as a scalable and efficient solution for accelerating LLMs on specialized hardware.

AccLLM: Accelerating Long-Context LLM Inference Via Algorithm-Hardware Co-Design

Hardware Architecture

Makes smart computer programs run faster on small devices.

7 Apr 2025 1

90%

LUT-LLM: Efficient Large Language Model Inference with Memory-based Computations on FPGAs

Hardware Architecture

Makes AI run faster and use less power.

9 Nov 2025 1

89%

ELUTQ: Efficient LUT-Aware Quantization for Deploying Large Language Models on Edge Devices

Machine Learning (CS)

Makes smart AI run on phones, faster and smaller.

22 Oct 2025 1

View PDF Login to Bookmark

Country of Origin

🇮🇷 Iran

Page Count

7 pages

AxLLM: accelerator architecture for large language models with computation reuse capability

Makes AI models run faster and use less power.

Technical Abstract

AccLLM: Accelerating Long-Context LLM Inference Via Algorithm-Hardware Co-Design

LUT-LLM: Efficient Large Language Model Inference with Memory-based Computations on FPGAs

ELUTQ: Efficient LUT-Aware Quantization for Deploying Large Language Models on Edge Devices