Score: 0

Mixture-of-Channels: Exploiting Sparse FFNs for Efficient LLMs Pre-Training and Inference

Published: November 12, 2025 | arXiv ID: 2511.09323v1

By: Tong Wu , Yutong He , Bin Wang and more

Potential Business Impact:

Makes AI models use less computer memory.

Business Areas:

Natural Language Processing Artificial Intelligence, Data and Analytics, Software

Large language models (LLMs) have demonstrated remarkable success across diverse artificial intelligence tasks, driven by scaling laws that correlate model size and training data with performance improvements. However, this scaling paradigm incurs substantial memory overhead, creating significant challenges for both training and inference. While existing research has primarily addressed parameter and optimizer state memory reduction, activation memory-particularly from feed-forward networks (FFNs)-has become the critical bottleneck, especially when FlashAttention is implemented. In this work, we conduct a detailed memory profiling of LLMs and identify FFN activations as the predominant source to activation memory overhead. Motivated by this, we introduce Mixture-of-Channels (MoC), a novel FFN architecture that selectively activates only the Top-K most relevant channels per token determined by SwiGLU's native gating mechanism. MoC substantially reduces activation memory during pre-training and improves inference efficiency by reducing memory access through partial weight loading into GPU SRAM. Extensive experiments validate that MoC delivers significant memory savings and throughput gains while maintaining competitive model performance.

Flash Multi-Head Feed-Forward Network

Machine Learning (CS)

Makes AI smarter and faster using less memory.

7 Dec 2025 0

87%

MLPMoE: Zero-Shot Architectural Metamorphosis of Dense LLM MLPs into Static Mixture-of-Experts

Machine Learning (CS)

Makes AI smarter and faster without retraining.

26 Nov 2025 0

87%

FlowMoE: A Scalable Pipeline Scheduling Framework for Distributed Mixture-of-Experts Training

Distributed, Parallel, and Cluster Computing

Makes AI learn faster and use less power.

30 Sep 2025 3

View PDF Login to Bookmark

Country of Origin

🇨🇳 China

Page Count

20 pages

Mixture-of-Channels: Exploiting Sparse FFNs for Efficient LLMs Pre-Training and Inference

Makes AI models use less computer memory.

Technical Abstract

Flash Multi-Head Feed-Forward Network

MLPMoE: Zero-Shot Architectural Metamorphosis of Dense LLM MLPs into Static Mixture-of-Experts

FlowMoE: A Scalable Pipeline Scheduling Framework for Distributed Mixture-of-Experts Training