Score: 1

MoQE: Improve Quantization Model performance via Mixture of Quantization Experts

Published: August 9, 2025 | arXiv ID: 2508.09204v1

By: Jinhao Zhang , Yunquan Zhang , Boyang Zhang and more

Potential Business Impact:

Helps AI work better on small devices.

Quantization method plays a crucial role in improving model efficiency and reducing deployment costs, enabling the widespread application of deep learning models on resource-constrained devices. However, the quantization process inevitably introduces accuracy degradation. In this paper, we propose Mixture of Quantization Experts( abbr. MoQE), a quantization inference framework based on the Mixture-of-Experts (MoE) architecture, aiming to jointly improve the performance of quantization models. MoQE combines multiple quantization variants of one full-precision model as specialized "quantization experts" and dynamically routes input data to the most suitable expert based on its characteristics. MoQE alleviates the performance degradation commonly seen in single quantization models through specialization quantization expert models. We design lightweight, structure-aware router models tailored for both CV and NLP tasks. Experimental evaluations on ResNet, LLaMA, and Qwen model families across benchmark datasets including ImageNet, WikiText, C4, and OpenWebText demonstrate that MoQE achieves performance comparable to SOTA quantization model, without incurring significant increases in inference latency.

MoQAE: Mixed-Precision Quantization for Long-Context LLM Inference via Mixture of Quantization-Aware Experts

CV and Pattern Recognition

Makes AI remember more without using much memory.

9 Jun 2025 1

93%

MoPEQ: Mixture of Mixed Precision Quantized Experts

Machine Learning (CS)

Makes big AI models smaller, faster, and cheaper.

2 Sep 2025 1

92%

MoEQuant: Enhancing Quantization for Mixture-of-Experts Large Language Models via Expert-Balanced Sampling and Affinity Guidance

Machine Learning (CS)

Makes smart computer brains smaller and faster.

2 May 2025 1

View PDF Login to Bookmark

Page Count

10 pages

MoQE: Improve Quantization Model performance via Mixture of Quantization Experts

Helps AI work better on small devices.

Technical Abstract

MoQAE: Mixed-Precision Quantization for Long-Context LLM Inference via Mixture of Quantization-Aware Experts

MoPEQ: Mixture of Mixed Precision Quantized Experts

MoEQuant: Enhancing Quantization for Mixture-of-Experts Large Language Models via Expert-Balanced Sampling and Affinity Guidance