Score: 0

Characterizing Communication Patterns in Distributed Large Language Model Inference

Published: July 18, 2025 | arXiv ID: 2507.14392v1

By: Lang Xu , Kaushik Kandadi Suresh , Quentin Anthony and more

Potential Business Impact:

Makes AI talk faster by fixing how computers share info.

Large Language Models (LLMs) built on transformer architectures have transformed natural language processing, achieving remarkable performance across diverse applications. While distributed inference frameworks enable practical deployment of these models, inter-GPU communication creates significant performance constraints that limit service quality in real-world systems. This paper investigates communication dynamics in distributed LLM serving-analyzing how various parallelization approaches coordinate data exchange between GPU workers during inference. We study dense transformer-based models as representative examples of contemporary architectures widely used in operational deployments. Our work combines detailed profiling measurements with predictive analytical models to characterize communication behavior across different parallelization configurations. Results show that tensor parallelism incurs substantial network overhead but delivers superior response times for brief sequences, pipeline parallelism minimizes data transfer requirements while increasing total latency, and combined approaches demand careful tuning to achieve balanced performance. These insights offer practical recommendations for selecting appropriate parallelization schemes in production LLM services and identify key opportunities for optimizing inference frameworks and communication infrastructure.

System-performance and cost modeling of Large Language Model training and inference

Hardware Architecture

Makes big AI models train and run cheaper.

3 Jul 2025 1

91%

Analyzing Communication Predictability in LLM Training

Networking and Internet Architecture

Makes AI training faster and more efficient.

31 Dec 2025 0

91%

Efficient LLM Inference: Bandwidth, Compute, Synchronization, and Capacity are all you need

Hardware Architecture

Makes AI understand questions faster and cheaper.

18 Jul 2025 0

View PDF Login to Bookmark

Country of Origin

🇺🇸 United States

Page Count

11 pages

Characterizing Communication Patterns in Distributed Large Language Model Inference

Makes AI talk faster by fixing how computers share info.

Technical Abstract

System-performance and cost modeling of Large Language Model training and inference

Analyzing Communication Predictability in LLM Training

Efficient LLM Inference: Bandwidth, Compute, Synchronization, and Capacity are all you need