Score: 1

SUMO: Subspace-Aware Moment-Orthogonalization for Accelerating Memory-Efficient LLM Training

Published: May 30, 2025 | arXiv ID: 2505.24749v1

By: Yehonathan Refael , Guy Smorodinsky , Tom Tirer and more

Potential Business Impact:

Makes AI learn much faster and use less memory.

Business Areas:

Semantic Search Internet Services

Low-rank gradient-based optimization methods have significantly improved memory efficiency during the training of large language models (LLMs), enabling operations within constrained hardware without sacrificing performance. However, these methods primarily emphasize memory savings, often overlooking potential acceleration in convergence due to their reliance on standard isotropic steepest descent techniques, which can perform suboptimally in the highly anisotropic landscapes typical of deep networks, particularly LLMs. In this paper, we propose SUMO (Subspace-Aware Moment-Orthogonalization), an optimizer that employs exact singular value decomposition (SVD) for moment orthogonalization within a dynamically adapted low-dimensional subspace, enabling norm-inducing steepest descent optimization steps. By explicitly aligning optimization steps with the spectral characteristics of the loss landscape, SUMO effectively mitigates approximation errors associated with commonly used methods like Newton-Schulz orthogonalization approximation. We theoretically establish an upper bound on these approximation errors, proving their dependence on the condition numbers of moments, conditions we analytically demonstrate are encountered during LLM training. Furthermore, we both theoretically and empirically illustrate that exact orthogonalization via SVD substantially improves convergence rates while reducing overall complexity. Empirical evaluations confirm that SUMO accelerates convergence, enhances stability, improves performance, and reduces memory requirements by up to 20% compared to state-of-the-art methods.

COSMOS: A Hybrid Adaptive Optimizer for Memory-Efficient Training of LLMs

Machine Learning (CS)

Makes AI learn faster and use less memory.

24 Feb 2025 2

88%

Sculpting Subspaces: Constrained Full Fine-Tuning in LLMs for Continual Learning

Machine Learning (CS)

Keeps AI smart on new tasks, not forgetting old ones.

9 Apr 2025 1

87%

FFT-based Dynamic Subspace Selection for Low-Rank Adaptive Optimization of Large Language Models

Machine Learning (CS)

Trains big AI faster and uses less memory.

23 May 2025 1

View PDF Login to Bookmark

Country of Origin

🇮🇱 Israel

Page Count

24 pages

SUMO: Subspace-Aware Moment-Orthogonalization for Accelerating Memory-Efficient LLM Training

Makes AI learn much faster and use less memory.

Technical Abstract

COSMOS: A Hybrid Adaptive Optimizer for Memory-Efficient Training of LLMs

Sculpting Subspaces: Constrained Full Fine-Tuning in LLMs for Continual Learning

FFT-based Dynamic Subspace Selection for Low-Rank Adaptive Optimization of Large Language Models