Score: 0

Controlled LLM Training on Spectral Sphere

Published: January 13, 2026 | arXiv ID: 2601.08393v1

By: Tian Xie , Haoming Luo , Haoyu Tang and more

Scaling large models requires optimization strategies that ensure rapid convergence grounded in stability. Maximal Update Parametrization ($\boldsymbolμ$P) provides a theoretical safeguard for width-invariant $Θ(1)$ activation control, whereas emerging optimizers like Muon are only ``half-aligned'' with these constraints: they control updates but allow weights to drift. To address this limitation, we introduce the \textbf{Spectral Sphere Optimizer (SSO)}, which enforces strict module-wise spectral constraints on both weights and their updates. By deriving the steepest descent direction on the spectral sphere, SSO realizes a fully $\boldsymbolμ$P-aligned optimization process. To enable large-scale training, we implement SSO as an efficient parallel algorithm within Megatron. Through extensive pretraining on diverse architectures, including Dense 1.7B, MoE 8B-A1B, and 200-layer DeepNet models, SSO consistently outperforms AdamW and Muon. Furthermore, we observe significant practical stability benefits, including improved MoE router load balancing, suppressed outliers, and strictly bounded activations.

Towards a Principled Muon under $μ\mathsf{P}$: Ensuring Spectral Conditions throughout Training

Machine Learning (CS)

Makes AI learn better and faster.

4 Jan 2026 0

88%

When do spectral gradient updates help in deep learning?

Machine Learning (CS)

Makes AI learn faster by changing how it trains.

3 Dec 2025 1

87%

SUMO: Subspace-Aware Moment-Orthogonalization for Accelerating Memory-Efficient LLM Training

Machine Learning (CS)

Makes AI learn much faster and use less memory.

30 May 2025 1

View PDF Login to Bookmark

Controlled LLM Training on Spectral Sphere

Technical Abstract

Towards a Principled Muon under $μ\mathsf{P}$: Ensuring Spectral Conditions throughout Training

When do spectral gradient updates help in deep learning?

SUMO: Subspace-Aware Moment-Orthogonalization for Accelerating Memory-Efficient LLM Training