Score: 1

ITERA-LLM: Boosting Sub-8-Bit Large Language Model Inference via Iterative Tensor Decomposition

Published: May 13, 2025 | arXiv ID: 2505.08981v1

By: Keran Zheng , Yinting Huang , Zhewen Yu and more

Potential Business Impact:

Makes big AI models run faster on small devices.

Business Areas:

Quantum Computing Science and Engineering

Recent advancements in Large Language Models (LLMs) have demonstrated impressive capabilities as their scale expands to billions of parameters. Deploying these large-scale models on resource-constrained platforms presents significant challenges, with post-training fixed-point quantization often used as a model compression technique. However, quantization-only methods typically lead to significant accuracy degradation in LLMs when precision falls below 8 bits. This paper addresses this challenge through a software-hardware co-design framework, ITERA-LLM, which integrates sub-8-bit quantization with SVD-based iterative low-rank tensor decomposition for error compensation, leading to higher compression ratios and reduced computational complexity. The proposed approach is complemented by a hardware-aware Design Space Exploration (DSE) process that optimizes accuracy, latency, and resource utilization, tailoring the configuration to the specific requirements of the targeted LLM. Our results show that ITERA-LLM achieves linear layer latency reduction of up to 41.1%, compared to quantization-only baseline approach while maintaining similar model accuracy.

InfiJanice: Joint Analysis and In-situ Correction Engine for Quantization-Induced Math Degradation in Large Language Models

Machine Learning (CS)

Fixes AI math mistakes after shrinking it.

16 May 2025 0

89%

QUAD: Quantization and Parameter-Efficient Tuning of LLM with Activation Decomposition

Machine Learning (CS)

Makes big computer brains work faster, smarter.

25 Mar 2025 0

88%

Exploring Layer-wise Information Effectiveness for Post-Training Quantization in Small Language Models

Machine Learning (CS)

Makes smart computer programs smaller and faster.

5 Aug 2025 1

View PDF Login to Bookmark

Country of Origin

🇬🇧 United Kingdom

Page Count

9 pages

ITERA-LLM: Boosting Sub-8-Bit Large Language Model Inference via Iterative Tensor Decomposition

Makes big AI models run faster on small devices.

Technical Abstract

InfiJanice: Joint Analysis and In-situ Correction Engine for Quantization-Induced Math Degradation in Large Language Models

QUAD: Quantization and Parameter-Efficient Tuning of LLM with Activation Decomposition

Exploring Layer-wise Information Effectiveness for Post-Training Quantization in Small Language Models