Score: 3

UltraMemV2: Memory Networks Scaling to 120B Parameters with Superior Long-Context Learning

Published: August 26, 2025 | arXiv ID: 2508.18756v1

By: Zihao Huang , Yu Bao , Qiyang Min and more

BigTech Affiliations: ByteDance

Potential Business Impact:

Makes smart computer programs use less memory.

Business Areas:

A/B Testing Data and Analytics

While Mixture of Experts (MoE) models achieve remarkable efficiency by activating only subsets of parameters, they suffer from high memory access costs during inference. Memory-layer architectures offer an appealing alternative with very few memory access, but previous attempts like UltraMem have only matched the performance of 2-expert MoE models, falling significantly short of state-of-the-art 8-expert configurations. We present UltraMemV2, a redesigned memory-layer architecture that closes this performance gap. Our approach introduces five key improvements: integrating memory layers into every transformer block, simplifying value expansion with single linear projections, adopting FFN-based value processing from PEER, implementing principled parameter initialization, and rebalancing memory-to-FFN computation ratios. Through extensive evaluation, we demonstrate that UltraMemV2 achieves performance parity with 8-expert MoE models under same computation and parameters but significantly low memory access. Notably, UltraMemV2 shows superior performance on memory-intensive tasks, with improvements of +1.6 points on long-context memorization, +6.2 points on multi-round memorization, and +7.9 points on in-context learning. We validate our approach at scale with models up to 2.5B activated parameters from 120B total parameters, and establish that activation density has greater impact on performance than total sparse parameter count. Our work brings memory-layer architectures to performance parity with state-of-the-art MoE models, presenting a compelling alternative for efficient sparse computation.

VideoMem: Enhancing Ultra-Long Video Understanding via Adaptive Memory Management

CV and Pattern Recognition

Lets computers watch and remember long videos.

4 Dec 2025 0

86%

Stratum: System-Hardware Co-Design with Tiered Monolithic 3D-Stackable DRAM for Efficient MoE Serving

Hardware Architecture

Makes smart computer brains faster and use less power.

6 Oct 2025 1

86%

Accelerating Mixture-of-Expert Inference with Adaptive Expert Split Mechanism

Machine Learning (CS)

Makes AI models run faster and cheaper.

10 Sep 2025 0

View PDF Login to Bookmark

Country of Origin

🇨🇳 China

Repos / Data Links

github.com

Page Count

24 pages

UltraMemV2: Memory Networks Scaling to 120B Parameters with Superior Long-Context Learning

Makes smart computer programs use less memory.

Technical Abstract

VideoMem: Enhancing Ultra-Long Video Understanding via Adaptive Memory Management

Stratum: System-Hardware Co-Design with Tiered Monolithic 3D-Stackable DRAM for Efficient MoE Serving

Accelerating Mixture-of-Expert Inference with Adaptive Expert Split Mechanism