Score: 0

Cautious Weight Decay

Published: October 14, 2025 | arXiv ID: 2510.12402v1

By: Lizhang Chen , Jonathan Li , Kaizhao Liang and more

Potential Business Impact:

Makes computer learning better and faster.

Business Areas:
A/B Testing Data and Analytics

We introduce Cautious Weight Decay (CWD), a one-line, optimizer-agnostic modification that applies weight decay only to parameter coordinates whose signs align with the optimizer update. Unlike standard decoupled decay, which implicitly optimizes a regularized or constrained objective, CWD preserves the original loss and admits a bilevel interpretation: it induces sliding-mode behavior upon reaching the stationary manifold, allowing it to search for locally Pareto-optimal stationary points of the unmodified objective. In practice, CWD is a drop-in change for optimizers such as AdamW, Lion, and Muon, requiring no new hyperparameters or additional tuning. For language model pre-training and ImageNet classification, CWD consistently improves final loss and accuracy at million- to billion-parameter scales.

Country of Origin
🇺🇸 United States

Page Count
36 pages

Category
Computer Science:
Machine Learning (CS)