Compute Can't Handle the Truth: Why Communication Tax Prioritizes Memory and Interconnects in Modern AI Infrastructure
By: Myoungsoo Jung
Potential Business Impact:
Builds super-fast AI by connecting computer parts better.
Modern AI workloads such as large language models (LLMs) and retrieval-augmented generation (RAG) impose severe demands on memory, communication bandwidth, and resource flexibility. Traditional GPU-centric architectures struggle to scale due to growing inter-GPU communication overheads. This report introduces key AI concepts and explains how Transformers revolutionized data representation in LLMs. We analyze large-scale AI hardware and data center designs, identifying scalability bottlenecks in hierarchical systems. To address these, we propose a modular data center architecture based on Compute Express Link (CXL) that enables disaggregated scaling of memory, compute, and accelerators. We further explore accelerator-optimized interconnects-collectively termed XLink (e.g., UALink, NVLink, NVLink Fusion)-and introduce a hybrid CXL-over-XLink design to reduce long-distance data transfers while preserving memory coherence. We also propose a hierarchical memory model that combines local and pooled memory, and evaluate lightweight CXL implementations, HBM, and silicon photonics for efficient scaling. Our evaluations demonstrate improved scalability, throughput, and flexibility in AI infrastructure.
Similar Papers
AI Accelerators for Large Language Model Inference: Architecture Analysis and Scaling Strategies
Hardware Architecture
Finds best computer chips for AI tasks.
Scaling Intelligence: Designing Data Centers for Next-Gen Language Models
Hardware Architecture
Builds faster, cheaper computer centers for giant AI.
Characterizing Communication Patterns in Distributed Large Language Model Inference
Distributed, Parallel, and Cluster Computing
Makes AI talk faster by fixing how computers share info.