ENA: Efficient N-dimensional Attention
By: Yibo Zhong
Potential Business Impact:
Helps computers understand complex, long data faster.
Efficient modeling of long sequences of high-order data requires a more efficient architecture than Transformer. In this paper, we investigate two key aspects of extending linear recurrent models, especially those originally designed for language modeling, to high-order data (1D to ND): scanning strategies and attention-hybrid architectures. Empirical results suggest that scanning provides limited benefits, while attention-hybrid models yield promising results. Focusing on the latter, we further evaluate types of attention and find that tiled high-order sliding window attention (SWA) is efficient in both theory and practice. We term the resulting hybrid architecture of linear recurrence and high-order SWA as Efficient N-dimensional Attention (ENA). We then conduct several experiments to demonstrate its effectiveness. The intuition behind ENA is that linear recurrence compresses global information into a state, while SWA complements it by enforcing strict local modeling. Together, they form a simple framework that offers a promising and practical solution for ultra-long high-order data modeling.
Similar Papers
Native Hybrid Attention for Efficient Sequence Modeling
Computation and Language
Makes AI understand long stories better and faster.
Fractional neural attention for efficient multiscale sequence processing
Machine Learning (CS)
Makes AI smarter by copying how brains pay attention.
GatedFWA: Linear Flash Windowed Attention with Gated Associative Memory
Machine Learning (CS)
Makes AI models learn faster and remember more.