Score: 3

Gecko: An Efficient Neural Architecture Inherently Processing Sequences with Arbitrary Lengths

Published: January 10, 2026 | arXiv ID: 2601.06463v1

By: Xuezhe Ma , Shicheng Wen , Linghao Jin and more

BigTech Affiliations: Massachusetts Institute of Technology Meta

Potential Business Impact:

Lets computers remember much longer stories.

Business Areas:
Natural Language Processing Artificial Intelligence, Data and Analytics, Software

Designing a unified neural network to efficiently and inherently process sequential data with arbitrary lengths is a central and challenging problem in sequence modeling. The design choices in Transformer, including quadratic complexity and weak length extrapolation, have limited their ability to scale to long sequences. In this work, we propose Gecko, a neural architecture that inherits the design of Mega and Megalodon (exponential moving average with gated attention), and further introduces multiple technical components to improve its capability to capture long range dependencies, including timestep decay normalization, sliding chunk attention mechanism, and adaptive working memory. In a controlled pretraining comparison with Llama2 and Megalodon in the scale of 7 billion parameters and 2 trillion training tokens, Gecko achieves better efficiency and long-context scalability. Gecko reaches a training loss of 1.68, significantly outperforming Llama2-7B (1.75) and Megalodon-7B (1.70), and landing close to Llama2-13B (1.67). Notably, without relying on any context-extension techniques, Gecko exhibits inherent long-context processing and retrieval capabilities, stably handling sequences of up to 4 million tokens and retrieving information from contexts up to $4\times$ longer than its attention window. Code: https://github.com/XuezheMax/gecko-llm

Country of Origin
🇺🇸 United States

Repos / Data Links

Page Count
22 pages

Category
Computer Science:
Machine Learning (CS)