Attention Is Not All You Need: The Importance of Feedforward Networks in Transformer Models
By: Isaac Gerber
Potential Business Impact:
Makes AI learn better with fewer parts.
Decoder-only transformer networks have become incredibly popular for language modeling tasks. State-of-the-art models can have over a hundred transformer blocks, containing billions of trainable parameters, and are trained on trillions of tokens of text. Each transformer block typically consists of a multi-head attention (MHA) mechanism and a two-layer fully connected feedforward network (FFN). In this paper, we examine the importance of the FFN during the model pre-training process through a series of experiments, confirming that the FFN is important to model performance. Furthermore, we show that models using a transformer block configuration with three-layer FFNs with fewer such blocks outperform the standard two-layer configuration delivering lower training loss with fewer total parameters in less time.
Similar Papers
Layerwise Importance Analysis of Feed-Forward Networks in Transformer-based Language Models
Computation and Language
Makes AI smarter by moving its thinking parts.
Flash Multi-Head Feed-Forward Network
Machine Learning (CS)
Makes AI smarter and faster using less memory.
Is Random Attention Sufficient for Sequence Modeling? Disentangling Trainable Components in the Transformer
Machine Learning (CS)
Lets computers learn by focusing on important words.