Scalable Synthesis of distributed LLM workloads through Symbolic Tensor Graphs
By: Changhai Man , Joongun Park , Hanjiang Wu and more
Potential Business Impact:
Builds fake computer runs to test AI.
Optimizing the performance of large language models (LLMs) on large-scale AI training and inference systems requires a scalable and expressive mechanism to model distributed workload execution. Such modeling is essential for pre-deployment system-level optimizations (e.g., parallelization strategies) and design-space explorations. While recent efforts have proposed collecting execution traces from real systems, access to large-scale infrastructure remains limited to major cloud providers. Moreover, traces obtained from existing platforms cannot be easily adapted to study future larger-scale system configurations. We introduce Symbolic Tensor grAph GEnerator(STAGE), a framework that synthesizes high-fidelity execution traces to accurately model LLM workloads. STAGE supports a comprehensive set of parallelization strategies, allowing users to systematically explore a wide spectrum of LLM architectures and system configurations. STAGE demonstrates its scalability by synthesizing high-fidelity LLM traces spanning over 32K GPUs, while preserving tensor-level accuracy in compute, memory, and communication. STAGE will be publicly available to facilitate further research in distributed machine learning systems.
Similar Papers
STAGE: A Symbolic Tensor grAph GEnerator for distributed AI system co-design
Distributed, Parallel, and Cluster Computing
Creates fake computer runs to test AI.
Large Language Models as Realistic Microservice Trace Generators
Software Engineering
Makes computer programs run better by faking real-world use.
Characterizing Communication Patterns in Distributed Large Language Model Inference
Distributed, Parallel, and Cluster Computing
Makes AI talk faster by fixing how computers share info.