Score: 0

MIDUS: Memory-Infused Depth Up-Scaling

Published: December 15, 2025 | arXiv ID: 2512.13751v1

By: Taero Kim , Hoyoon Byun , Youngjun Choi and more

Scaling large language models (LLMs) demands approaches that increase capacity without incurring excessive parameter growth or inference cost. Depth Up-Scaling (DUS) has emerged as a promising strategy by duplicating layers and applying Continual Pre-training (CPT), but its reliance on feed-forward networks (FFNs) limits efficiency and attainable gains. We introduce Memory-Infused Depth Up-Scaling (MIDUS), which replaces FFNs in duplicated blocks with a head-wise memory (HML) layer. Motivated by observations that attention heads have distinct roles both across and within layers, MIDUS assigns an independent memory bank to each head, enabling head-wise retrieval and injecting information into subsequent layers while preserving head-wise functional structure. This design combines sparse memory access with head-wise representations and incorporates an efficient per-head value factorization module, thereby relaxing the usual efficiency-performance trade-off. Across our CPT experiments, MIDUS exhibits robust performance improvements over strong DUS baselines while maintaining a highly efficient parameter footprint. Our findings establish MIDUS as a compelling and resource-efficient alternative to conventional FFN replication for depth up-scaling by leveraging its head-wise memory design.

UltraMemV2: Memory Networks Scaling to 120B Parameters with Superior Long-Context Learning

Machine Learning (CS)

Makes smart computer programs use less memory.

26 Aug 2025 3

86%

DeepInsert: Early Layer Bypass for Efficient and Performant Multimodal Understanding

CV and Pattern Recognition

Makes AI learn faster and use less power.

27 Apr 2025 0

86%

Flash Multi-Head Feed-Forward Network

Machine Learning (CS)

Makes AI smarter and faster using less memory.

7 Dec 2025 0

View PDF Login to Bookmark

MIDUS: Memory-Infused Depth Up-Scaling

Technical Abstract

UltraMemV2: Memory Networks Scaling to 120B Parameters with Superior Long-Context Learning

DeepInsert: Early Layer Bypass for Efficient and Performant Multimodal Understanding

Flash Multi-Head Feed-Forward Network