Frac-Connections: Fractional Extension of Hyper-Connections
By: Defa Zhu , Hongzhi Huang , Jundong Zhou and more
Potential Business Impact:
Makes computer learning faster and use less memory.
Residual connections are central to modern deep learning architectures, enabling the training of very deep networks by mitigating gradient vanishing. Hyper-Connections recently generalized residual connections by introducing multiple connection strengths at different depths, thereby addressing the seesaw effect between gradient vanishing and representation collapse. However, Hyper-Connections increase memory access costs by expanding the width of hidden states. In this paper, we propose Frac-Connections, a novel approach that divides hidden states into multiple parts rather than expanding their width. Frac-Connections retain partial benefits of Hyper-Connections while reducing memory consumption. To validate their effectiveness, we conduct large-scale experiments on language tasks, with the largest being a 7B MoE model trained on up to 3T tokens, demonstrating that Frac-Connections significantly outperform residual connections.
Similar Papers
Hierarchical Residuals Exploit Brain-Inspired Compositionality
Machine Learning (CS)
Brain-inspired computer learns faster, sees better.
ResNets Are Deeper Than You Think
Machine Learning (CS)
Makes computer learning better by changing how it learns.
Residual connections provably mitigate oversmoothing in graph neural networks
Machine Learning (CS)
Fixes "too deep" computer learning problems.