Score: 0

Can SGD Handle Heavy-Tailed Noise?

Published: August 6, 2025 | arXiv ID: 2508.04860v1

By: Ilyas Fatkhullin, Florian Hübler, Guanghui Lan

Potential Business Impact:

Makes computers learn better with messy data.

Stochastic Gradient Descent (SGD) is a cornerstone of large-scale optimization, yet its theoretical behavior under heavy-tailed noise -- common in modern machine learning and reinforcement learning -- remains poorly understood. In this work, we rigorously investigate whether vanilla SGD, devoid of any adaptive modifications, can provably succeed under such adverse stochastic conditions. Assuming only that stochastic gradients have bounded $p$-th moments for some $p \in (1, 2]$, we establish sharp convergence guarantees for (projected) SGD across convex, strongly convex, and non-convex problem classes. In particular, we show that SGD achieves minimax optimal sample complexity under minimal assumptions in the convex and strongly convex regimes: $\mathcal{O}(\varepsilon^{-\frac{p}{p-1}})$ and $\mathcal{O}(\varepsilon^{-\frac{p}{2(p-1)}})$, respectively. For non-convex objectives under H\"older smoothness, we prove convergence to a stationary point with rate $\mathcal{O}(\varepsilon^{-\frac{2p}{p-1}})$, and complement this with a matching lower bound specific to SGD with arbitrary polynomial step-size schedules. Finally, we consider non-convex Mini-batch SGD under standard smoothness and bounded central moment assumptions, and show that it also achieves a comparable $\mathcal{O}(\varepsilon^{-\frac{2p}{p-1}})$ sample complexity with a potential improvement in the smoothness constant. These results challenge the prevailing view that heavy-tailed noise renders SGD ineffective, and establish vanilla SGD as a robust and theoretically principled baseline -- even in regimes where the variance is unbounded.

Quantitative Convergence Analysis of Projected Stochastic Gradient Descent for Non-Convex Losses via the Goldstein Subdifferential

Optimization and Control

Makes AI learn faster without needing extra tricks.

3 Oct 2025 0

89%

Decentralized Nonconvex Optimization under Heavy-Tailed Noise: Normalization and Optimal Convergence

Optimization and Control

Helps computers learn better with messy data.

6 May 2025 1

89%

Accelerated stochastic first-order method for convex optimization under heavy-tailed noise

Optimization and Control

Makes computer learning faster with messy data.

13 Oct 2025 1

View PDF Login to Bookmark

Country of Origin

🇨🇭 Switzerland

Page Count

34 pages

Can SGD Handle Heavy-Tailed Noise?

Makes computers learn better with messy data.

Technical Abstract

Quantitative Convergence Analysis of Projected Stochastic Gradient Descent for Non-Convex Losses via the Goldstein Subdifferential

Decentralized Nonconvex Optimization under Heavy-Tailed Noise: Normalization and Optimal Convergence

Accelerated stochastic first-order method for convex optimization under heavy-tailed noise