Score: 0

Training Dynamics of the Cooldown Stage in Warmup-Stable-Decay Learning Rate Scheduler

Published: August 2, 2025 | arXiv ID: 2508.01483v1

By: Aleksandr Dremov , Alexander Hägele , Atli Kosson and more

Potential Business Impact:

Makes computer learning better by adjusting its speed.

Learning rate scheduling is essential in transformer training, where the final annealing plays a crucial role in getting the best performance. However, the mechanisms behind this cooldown phase, with its characteristic drop in loss, remain poorly understood. To address this, we provide a comprehensive analysis focusing solely on the cooldown phase in the Warmup-Stable-Decay (WSD) learning rate scheduler. Our analysis reveals that different cooldown shapes reveal a fundamental bias-variance trade-off in the resulting models, with shapes that balance exploration and exploitation consistently outperforming alternatives. Similarly, we find substantial performance variations $\unicode{x2013}$ comparable to those from cooldown shape selection $\unicode{x2013}$ when tuning AdamW hyperparameters. Notably, we observe consistent improvements with higher values of $\beta_2$ during cooldown. From a loss landscape perspective, we provide visualizations of the landscape during cooldown, supporting the river valley loss perspective empirically. These findings offer practical recommendations for configuring the WSD scheduler in transformer training, emphasizing the importance of optimizing the cooldown phase alongside traditional hyperparameter tuning.

WSM: Decay-Free Learning Rate Schedule via Checkpoint Merging for LLM Pre-training

Computation and Language

Makes computer learning faster and better.

23 Jul 2025 0

87%

WSM: Decay-Free Learning Rate Schedule via Checkpoint Merging for LLM Pre-training

Computation and Language

Makes computer learning better by combining models.

23 Jul 2025 0

86%

Straight to Zero: Why Linearly Decaying the Learning Rate to Zero Works Best for LLMs

Machine Learning (CS)

Trains AI faster, saving lots of computer power.

21 Feb 2025 1

View PDF Login to Bookmark

Page Count

29 pages

Training Dynamics of the Cooldown Stage in Warmup-Stable-Decay Learning Rate Scheduler

Makes computer learning better by adjusting its speed.

Technical Abstract

WSM: Decay-Free Learning Rate Schedule via Checkpoint Merging for LLM Pre-training

WSM: Decay-Free Learning Rate Schedule via Checkpoint Merging for LLM Pre-training

Straight to Zero: Why Linearly Decaying the Learning Rate to Zero Works Best for LLMs