SNOO: Step-K Nesterov Outer Optimizer - The Surprising Effectiveness of Nesterov Momentum Applied to Pseudo-Gradients
By: Dominik Kallusky , Vinay Rao , Vishal Nandavanam and more
Potential Business Impact:
Makes computer learning faster and better.
The rapid development of large language models (LLMs) has driven the demand for more efficient optimization techniques. Among these, the Lookahead family of optimizers employs a two-loop framework, maintaining fast and slow sets of model weights. Multiple inner optimizer steps on the fast weights produce a trajectory - the pseudo-gradient - that is used to update the slow weights. DiLoCo, a notable example originally designed for distributed training, applies Nesterov momentum to the averaged pseudo-gradient from multiple workers, claiming to even outperform AdamW in a non-distributed setup. In this paper, we empirically show that DiLoCo's surprising effectiveness stems primarily from applying Nesterov momentum to the pseudo-gradient, which improves training in a non-distributed setting. We call this Lookahead variant the Step-$K$ Nesterov Outer Optimizer (SNOO). We demonstrate that SNOO achieves compute factor gains of 1.5 - 2.5$\times$ in a non-distributed setting up to a scale of 1e23 training FLOPs, with improvements that increase with model size. Because of its minimal compute and memory overhead and compatibility with model sharding, SNOO is a practical enhancement for a variety of inner optimizers, including AdamW and Muon.
Similar Papers
Enhancing Optimizer Stability: Momentum Adaptation of The NGN Step-size
Machine Learning (CS)
Makes computer learning work better, even with bad settings.
Better LMO-based Momentum Methods with Second-Order Information
Optimization and Control
Makes computer learning faster and better.
ANO : Faster is Better in Noisy Landscape
Machine Learning (CS)
Makes AI learn better even when things change.