Cautious Weight Decay
By: Lizhang Chen , Jonathan Li , Kaizhao Liang and more
Potential Business Impact:
Makes computer learning better and faster.
We introduce Cautious Weight Decay (CWD), a one-line, optimizer-agnostic modification that applies weight decay only to parameter coordinates whose signs align with the optimizer update. Unlike standard decoupled decay, which implicitly optimizes a regularized or constrained objective, CWD preserves the original loss and admits a bilevel interpretation: it induces sliding-mode behavior upon reaching the stationary manifold, allowing it to search for locally Pareto-optimal stationary points of the unmodified objective. In practice, CWD is a drop-in change for optimizers such as AdamW, Lion, and Muon, requiring no new hyperparameters or additional tuning. For language model pre-training and ImageNet classification, CWD consistently improves final loss and accuracy at million- to billion-parameter scales.
Similar Papers
Correction of Decoupled Weight Decay
Machine Learning (CS)
Makes computer learning more stable and faster.
AdamHD: Decoupled Huber Decay Regularization for Language Model Pre-Training
Machine Learning (CS)
Makes AI learn faster and use less memory.
Dynamically Weighted Momentum with Adaptive Step Sizes for Efficient Deep Network Training
Machine Learning (CS)
Helps computers learn faster and better.