Score: 0

Multiscale Aggregated Hierarchical Attention (MAHA): A Game Theoretic and Optimization Driven Approach to Efficient Contextual Modeling in Large Language Models

Published: December 16, 2025 | arXiv ID: 2512.14925v1

By: Caner Erden

Potential Business Impact:

Makes AI understand long stories faster and cheaper.

Business Areas:

MMO Games Gaming

The quadratic computational complexity of MultiHead SelfAttention (MHSA) remains a fundamental bottleneck in scaling Large Language Models (LLMs) for longcontext tasks. While sparse and linearized attention mechanisms attempt to mitigate this, they often compromise the representation of global dependencies or fail to capture multiscale semantic granularity effectively. In this paper, we propose Multiscale Aggregated Hierarchical Attention (MAHA), a novel architectural framework that reformulates the attention mechanism through hierarchical decomposition and mathematically rigorous aggregation. Unlike conventional approaches that treat token interactions at a single resolution, MAHA dynamically partitions the input sequence into hierarchical scales via learnable downsampling operators. The core innovation lies in its aggregation strategy: we model the fusion of scalespecific attention matrices as a resource allocation problem, solved via a convex optimization framework or a Nash equilibriumbased gametheoretic approach. This ensures a theoretically optimal balance between local nuance and global context fidelity. Implemented within a hybrid dilatedconvolutional transformer backbone, MAHA utilizes differentiable optimization layers to enable endtoend training. Experimental evaluations demonstrate that MAHA achieves superior scalability; empirical FLOPs analysis confirms an 81% reduction in computational cost at a sequence length of 4096 compared to standard attention. This work bridges the gap between optimization theory and sequence modeling, offering a scalable solution for nextgeneration LLMs.

Long-Context Modeling with Dynamic Hierarchical Sparse Attention for On-Device LLMs

Computation and Language

Makes AI understand long texts faster and cheaper.

28 Oct 2025 2

89%

Knocking-Heads Attention

Computation and Language

Lets AI learn better by sharing ideas between parts.

27 Oct 2025 0

89%

Sparse Query Attention (SQA): A Computationally Efficient Attention Mechanism with Query Heads Reduction

Machine Learning (CS)

Makes AI learn faster with fewer calculations.

2 Oct 2025 1

View PDF Login to Bookmark

Page Count

12 pages

Multiscale Aggregated Hierarchical Attention (MAHA): A Game Theoretic and Optimization Driven Approach to Efficient Contextual Modeling in Large Language Models

Makes AI understand long stories faster and cheaper.

Technical Abstract

Long-Context Modeling with Dynamic Hierarchical Sparse Attention for On-Device LLMs

Knocking-Heads Attention

Sparse Query Attention (SQA): A Computationally Efficient Attention Mechanism with Query Heads Reduction