Score: 0

Optimal Condition for Initialization Variance in Deep Neural Networks: An SGD Dynamics Perspective

Published: August 18, 2025 | arXiv ID: 2508.12834v1

By: Hiroshi Horii, Sothea Has

Potential Business Impact:

Sets computer learning starting numbers for better results.

Stochastic gradient descent (SGD), one of the most fundamental optimization algorithms in machine learning (ML), can be recast through a continuous-time approximation as a Fokker-Planck equation for Langevin dynamics, a viewpoint that has motivated many theoretical studies. Within this framework, we study the relationship between the quasi-stationary distribution derived from this equation and the initial distribution through the Kullback-Leibler (KL) divergence. As the quasi-steady-state distribution depends on the expected cost function, the KL divergence eventually reveals the connection between the expected cost function and the initialization distribution. By applying this to deep neural network models (DNNs), we can express the bounds of the expected loss function explicitly in terms of the initialization parameters. Then, by minimizing this bound, we obtain an optimal condition of the initialization variance in the Gaussian case. This result provides a concrete mathematical criterion, rather than a heuristic approach, to select the scale of weight initialization in DNNs. In addition, we experimentally confirm our theoretical results by using the classical SGD to train fully connected neural networks on the MNIST and Fashion-MNIST datasets. The result shows that if the variance of the initialization distribution satisfies our theoretical optimal condition, then the corresponding DNN model always achieves lower final training loss and higher test accuracy than the conventional He-normal initialization. Our work thus supplies a mathematically grounded indicator that guides the choice of initialization variance and clarifies its physical meaning of the dynamics of parameters in DNNs.

Quantitative Convergence Analysis of Projected Stochastic Gradient Descent for Non-Convex Losses via the Goldstein Subdifferential

Optimization and Control

Makes AI learn faster without needing extra tricks.

3 Oct 2025 0

88%

Phase diagram and eigenvalue dynamics of stochastic gradient descent in multilayer neural networks

Disordered Systems and Neural Networks

Helps computers learn better by finding the best settings.

1 Sep 2025 0

88%

Tight Bounds for Schrödinger Potential Estimation in Unpaired Image-to-Image Translation Problems

Machine Learning (CS)

Makes pictures look like other pictures.

10 Aug 2025 0

View PDF Login to Bookmark

Country of Origin

🇫🇷 France

Page Count

9 pages

Optimal Condition for Initialization Variance in Deep Neural Networks: An SGD Dynamics Perspective

Sets computer learning starting numbers for better results.

Technical Abstract

Quantitative Convergence Analysis of Projected Stochastic Gradient Descent for Non-Convex Losses via the Goldstein Subdifferential

Phase diagram and eigenvalue dynamics of stochastic gradient descent in multilayer neural networks

Tight Bounds for Schrödinger Potential Estimation in Unpaired Image-to-Image Translation Problems