Deep Progressive Training: scaling up depth capacity of zero/one-layer models
By: Zhiqi Bu
Potential Business Impact:
Trains big computer brains faster, saving energy.
Model depth is a double-edged sword in deep learning: deeper models achieve higher accuracy but require higher computational cost. To efficiently train models at scale, an effective strategy is the progressive training, which scales up model capacity during training, hence significantly reducing computation with little to none performance degradation. In this work, we study the depth expansion of large models through the lens of optimization theory and feature learning, offering insights on the initialization of new layers, hyperparameter transfer, learning rate schedule, and timing of model expansion. Specifically, we propose zero/one-layer progressive training for the optimal tradeoff between computation and loss. For example, zero/one-layer progressive training on GPT2 can save $\approx 80\%$ compute, or equivalently accelerate $\approx 5\times$ while achieving almost the same loss, compared to to a fully trained 60-layer model with 7B parameters.
Similar Papers
Optimally Deep Networks -- Adapting Model Depth to Datasets for Superior Efficiency
Machine Learning (CS)
Makes smart computer programs smaller and faster.
Do Depth-Grown Models Overcome the Curse of Depth? An In-Depth Analysis
Computation and Language
Makes AI smarter by growing its brain slowly.
Self-Composing Neural Operators with Depth and Accuracy Scaling via Adaptive Train-and-Unroll Approach
Machine Learning (CS)
Makes computer models solve hard science problems faster.