Score: 3

Gecko: An Efficient Neural Architecture Inherently Processing Sequences with Arbitrary Lengths

Published: January 10, 2026 | arXiv ID: 2601.06463v1

By: Xuezhe Ma , Shicheng Wen , Linghao Jin and more

BigTech Affiliations: Massachusetts Institute of Technology Meta

Potential Business Impact:

Lets computers remember much longer stories.

Business Areas:

Natural Language Processing Artificial Intelligence, Data and Analytics, Software

Designing a unified neural network to efficiently and inherently process sequential data with arbitrary lengths is a central and challenging problem in sequence modeling. The design choices in Transformer, including quadratic complexity and weak length extrapolation, have limited their ability to scale to long sequences. In this work, we propose Gecko, a neural architecture that inherits the design of Mega and Megalodon (exponential moving average with gated attention), and further introduces multiple technical components to improve its capability to capture long range dependencies, including timestep decay normalization, sliding chunk attention mechanism, and adaptive working memory. In a controlled pretraining comparison with Llama2 and Megalodon in the scale of 7 billion parameters and 2 trillion training tokens, Gecko achieves better efficiency and long-context scalability. Gecko reaches a training loss of 1.68, significantly outperforming Llama2-7B (1.75) and Megalodon-7B (1.70), and landing close to Llama2-13B (1.67). Notably, without relying on any context-extension techniques, Gecko exhibits inherent long-context processing and retrieval capabilities, stably handling sequences of up to 4 million tokens and retrieving information from contexts up to $4\times$ longer than its attention window. Code: https://github.com/XuezheMax/gecko-llm

Lizard: An Efficient Linearization Framework for Large Language Models

Computation and Language

Lets computers remember more without slowing down.

11 Jul 2025 2

87%

Zeppelin: Balancing Variable-length Workloads in Data Parallel Large Model Training

Distributed, Parallel, and Cluster Computing

Trains AI faster by fixing computer work jams.

26 Sep 2025 2

86%

Near-Lossless Model Compression Enables Longer Context Inference in DNA Large Language Models

Genomics

Makes DNA computers understand much longer genetic codes.

18 Nov 2025 0

View PDF Login to Bookmark

Country of Origin

🇺🇸 United States

Repos / Data Links

github.com

Page Count

22 pages

Gecko: An Efficient Neural Architecture Inherently Processing Sequences with Arbitrary Lengths

Lets computers remember much longer stories.

Technical Abstract

Lizard: An Efficient Linearization Framework for Large Language Models

Zeppelin: Balancing Variable-length Workloads in Data Parallel Large Model Training

Near-Lossless Model Compression Enables Longer Context Inference in DNA Large Language Models