Score: 0

SliceMoE: Bit-Sliced Expert Caching under Miss-Rate Constraints for Efficient MoE Inference

Published: December 15, 2025 | arXiv ID: 2512.12990v1

By: Yuseon Choi , Sangjin Kim , Jungjun Oh and more

Potential Business Impact:

Makes smart AI run faster on phones.

Business Areas:

Quantum Computing Science and Engineering

MoE models offer efficient scaling through conditional computation, but their large parameter size and expensive expert offloading make on-device deployment challenging. Existing acceleration techniques such as prefetching or expert clustering often increase energy usage or reduce expert diversity. We present SliceMoE, an energy-efficient MoE inference framework for miss-rate-constrained deployment. SliceMoE introduces Dynamic Bit-Sliced Caching (DBSC), which caches experts at slice-level granularity and assigns precision on demand to expand effective expert capacity. To support mixed-precision experts without memory duplication, we propose Calibration-Free Asymmetric Matryoshka Quantization (AMAT), a truncation-based scheme that maintains compatibility between low-bit and high-bit slices. We further introduce Predictive Cache Warmup (PCW) to reduce early-decode cold misses by reshaping cache contents during prefill. Evaluated on DeepSeek-V2-Lite and Qwen1.5-MoE-A2.7B, SliceMoE reduces decode-stage energy consumption by up to 2.37x and 2.85x, respectively, and improves decode latency by up to 1.81x and 1.64x, while preserving near-high-bit accuracy. These results demonstrate that slice-level caching enables an efficient on-device MoE deployment.

MoE-SpeQ: Speculative Quantized Decoding with Proactive Expert Prefetching and Offloading for Mixture-of-Experts

Machine Learning (CS)

Makes smart AI run faster on less powerful computers.

18 Nov 2025 1

90%

SliceMoE: Routing Embedding Slices Instead of Tokens for Fine-Grained and Balanced Transformer Scaling

Computation and Language

Makes AI smarter and faster by splitting its thinking.

5 Oct 2025 1

90%

EAC-MoE: Expert-Selection Aware Compressor for Mixture-of-Experts Large Language Models

Machine Learning (CS)

Makes big AI models use less memory and run faster.

3 Aug 2025 0

View PDF Login to Bookmark

Country of Origin

🇰🇷 Korea, Republic of

Page Count

7 pages

SliceMoE: Bit-Sliced Expert Caching under Miss-Rate Constraints for Efficient MoE Inference

Makes smart AI run faster on phones.

Technical Abstract

MoE-SpeQ: Speculative Quantized Decoding with Proactive Expert Prefetching and Offloading for Mixture-of-Experts

SliceMoE: Routing Embedding Slices Instead of Tokens for Fine-Grained and Balanced Transformer Scaling

EAC-MoE: Expert-Selection Aware Compressor for Mixture-of-Experts Large Language Models