Mesh-Attention: A New Communication-Efficient Distributed Attention with Improved Data Locality
By: Sirui Chen , Jingji Chen , Siqi Zhu and more
Distributed attention is a fundamental problem for scaling context window for Large Language Models (LLMs). The state-of-the-art method, Ring-Attention, suffers from scalability limitations due to its excessive communication traffic. This paper proposes a new distributed attention algorithm, Mesh-Attention, by rethinking the design space of distributed attention with a new matrix-based model. Our method assigns a two-dimensional tile -- rather than one-dimensional row or column -- of computation blocks to each GPU to achieve higher efficiency through lower communication-computation (CommCom) ratio. The general approach covers Ring-Attention as a special case, and allows the tuning of CommCom ratio with different tile shapes. Importantly, we propose a greedy algorithm that can efficiently search the scheduling space within the tile with restrictions that ensure efficient communication among GPUs. The theoretical analysis shows that Mesh-Attention leads to a much lower communication complexity and exhibits good scalability comparing to other current algorithms. Our extensive experiment results show that Mesh-Attention can achieve up to 3.4x speedup (2.9x on average) and reduce the communication volume by up to 85.4% (79.0% on average) on 256 GPUs. Our scalability results further demonstrate that Mesh-Attention sustains superior performance as the system scales, substantially reducing overhead in large-scale deployments. The results convincingly confirm the advantage of Mesh-Attention.
Similar Papers
ATTENTION2D: Communication Efficient Distributed Self-Attention Mechanism
Machine Learning (CS)
Makes AI learn much faster with more computers.
TASP: Topology-aware Sequence Parallelism
Machine Learning (CS)
Makes AI understand long texts faster.
Attention and Compression is all you need for Controllably Efficient Language Models
Machine Learning (CS)
Lets computers remember more with less effort.