Gecko: An Efficient Neural Architecture Inherently Processing Sequences with Arbitrary Lengths
By: Xuezhe Ma , Shicheng Wen , Linghao Jin and more
Potential Business Impact:
Lets computers remember much longer stories.
Designing a unified neural network to efficiently and inherently process sequential data with arbitrary lengths is a central and challenging problem in sequence modeling. The design choices in Transformer, including quadratic complexity and weak length extrapolation, have limited their ability to scale to long sequences. In this work, we propose Gecko, a neural architecture that inherits the design of Mega and Megalodon (exponential moving average with gated attention), and further introduces multiple technical components to improve its capability to capture long range dependencies, including timestep decay normalization, sliding chunk attention mechanism, and adaptive working memory. In a controlled pretraining comparison with Llama2 and Megalodon in the scale of 7 billion parameters and 2 trillion training tokens, Gecko achieves better efficiency and long-context scalability. Gecko reaches a training loss of 1.68, significantly outperforming Llama2-7B (1.75) and Megalodon-7B (1.70), and landing close to Llama2-13B (1.67). Notably, without relying on any context-extension techniques, Gecko exhibits inherent long-context processing and retrieval capabilities, stably handling sequences of up to 4 million tokens and retrieving information from contexts up to $4\times$ longer than its attention window. Code: https://github.com/XuezheMax/gecko-llm
Similar Papers
Lizard: An Efficient Linearization Framework for Large Language Models
Computation and Language
Lets computers remember more without slowing down.
Zeppelin: Balancing Variable-length Workloads in Data Parallel Large Model Training
Distributed, Parallel, and Cluster Computing
Trains AI faster by fixing computer work jams.
Near-Lossless Model Compression Enables Longer Context Inference in DNA Large Language Models
Genomics
Makes DNA computers understand much longer genetic codes.