Score: 1

QMC: Efficient SLM Edge Inference via Outlier-Aware Quantization and Emergent Memories Co-Design

Published: January 21, 2026 | arXiv ID: 2601.14549v1

By: Nilesh Prasad Pandey , Jangseon Park , Onat Gungor and more

Potential Business Impact:

Makes AI on phones faster and use less power.

Business Areas:

Quantum Computing Science and Engineering

Deploying Small Language Models (SLMs) on edge platforms is critical for real-time, privacy-sensitive generative AI, yet constrained by memory, latency, and energy budgets. Quantization reduces model size and cost but suffers from device noise in emerging non-volatile memories, while conventional memory hierarchies further limit efficiency. SRAM provides fast access but has low density, DRAM must simultaneously accommodate static weights and dynamic KV caches, which creates bandwidth contention, and Flash, although dense, is primarily used for initialization and remains inactive during inference. These limitations highlight the need for hybrid memory organizations tailored to LLM inference. We propose Outlier-aware Quantization with Memory Co-design (QMC), a retraining-free quantization with a novel heterogeneous memory architecture. QMC identifies inlier and outlier weights in SLMs, storing inlier weights in compact multi-level Resistive-RAM (ReRAM) while preserving critical outliers in high-precision on-chip Magnetoresistive-RAM (MRAM), mitigating noise-induced degradation. On language modeling and reasoning benchmarks, QMC outperforms and matches state-of-the-art quantization methods using advanced algorithms and hybrid data formats, while achieving greater compression under both algorithm-only evaluation and realistic deployment settings. Specifically, compared against SoTA quantization methods on the latest edge AI platform, QMC reduces memory usage by 6.3x-7.3x, external data transfers by 7.6x, energy by 11.7x, and latency by 12.5x when compared to FP16, establishing QMC as a scalable, deployment-ready co-design for efficient on-device inference.

QSLM: A Performance- and Memory-aware Quantization Framework with Tiered Search Strategy for Spike-driven Language Models

Neural and Evolutionary Computing

Makes AI talk smaller and use less power.

2 Jan 2026 0

90%

SLMQuant:Benchmarking Small Language Model Quantization for Practical Deployment

Machine Learning (CS)

Makes small AI models work on phones.

17 Nov 2025 1

90%

Sensitivity-Aware Mixed-Precision Quantization for ReRAM-based Computing-in-Memory

Hardware Architecture

Makes computer chips use less power for AI.

22 Dec 2025 0

View PDF Login to Bookmark

Country of Origin

🇺🇸 United States

Page Count

8 pages

QMC: Efficient SLM Edge Inference via Outlier-Aware Quantization and Emergent Memories Co-Design

Makes AI on phones faster and use less power.

Technical Abstract

QSLM: A Performance- and Memory-aware Quantization Framework with Tiered Search Strategy for Spike-driven Language Models

SLMQuant:Benchmarking Small Language Model Quantization for Practical Deployment

Sensitivity-Aware Mixed-Precision Quantization for ReRAM-based Computing-in-Memory