Score: 1

On-Device Qwen2.5: Efficient LLM Inference with Model Compression and Hardware Acceleration

Published: April 24, 2025 | arXiv ID: 2504.17376v1

By: Maoyang Xiang, Ramesh Fernando, Bo Wang

Potential Business Impact:

Makes smart AI run faster on small devices.

Business Areas:

Quantum Computing Science and Engineering

Transformer-based Large Language Models (LLMs) have significantly advanced AI capabilities but pose considerable challenges for deployment on edge devices due to high computational demands, memory bandwidth constraints, and energy consumption. This paper addresses these challenges by presenting an efficient framework for deploying the Qwen2.5-0.5B model on the Xilinx Kria KV260 edge platform, a heterogeneous system integrating an ARM Cortex-A53 CPU with reconfigurable FPGA logic. Leveraging Activation-aware Weight Quantization (AWQ) with FPGA-accelerated execution pipelines, the proposed approach enhances both model compression rate and system throughput. Additionally, we propose a hybrid execution strategy that intelligently offloads compute-intensive operations to the FPGA while utilizing the CPU for lighter tasks, effectively balancing the computational workload and maximizing overall performance. Our framework achieves a model compression rate of 55.08% compared to the original model and produces output at a rate of 5.1 tokens per second, outperforming the baseline performance of 2.8 tokens per second.

Sustainable LLM Inference for Edge AI: Evaluating Quantized LLMs for Energy Efficiency, Output Accuracy, and Inference Latency

Computers and Society

Makes smart computer programs run on small devices.

4 Apr 2025 1

89%

LUT-LLM: Efficient Large Language Model Inference with Memory-based Computations on FPGAs

Hardware Architecture

Makes AI run faster and use less power.

9 Nov 2025 1

89%

Design and Implementation of an FPGA-Based Hardware Accelerator for Transformer

Hardware Architecture

Makes AI models run much faster and cheaper.

20 Mar 2025 0

View PDF Login to Bookmark

Country of Origin

🇸🇬 Singapore

Page Count

5 pages

On-Device Qwen2.5: Efficient LLM Inference with Model Compression and Hardware Acceleration

Makes smart AI run faster on small devices.

Technical Abstract

Sustainable LLM Inference for Edge AI: Evaluating Quantized LLMs for Energy Efficiency, Output Accuracy, and Inference Latency

LUT-LLM: Efficient Large Language Model Inference with Memory-based Computations on FPGAs

Design and Implementation of an FPGA-Based Hardware Accelerator for Transformer