A Systematic Characterization of LLM Inference on GPUs
By: Haonan Wang , Xuxin Xiao , Mingyu Yan and more
Potential Business Impact:
Makes AI understand and work much faster.
This work presents a systematic characterization of Large Language Model (LLM) inference to address fragmented understanding. Through comprehensive experiments, we establish a four-dimensional analytical framework: (1) Two-Phase Heterogeneity Observation; (2) Microarchitectural Root Cause Analysis; (3) System Scaling Principles; and (4) Emerging Paradigm Boundaries. Our investigation progresses systematically from observation to foresight: identifying performance phenomena, revealing hardware causes, validating system behavior, and exploring new paradigms. This study not only consolidates a reliable empirical foundation for existing research but also provides new discoveries and practical optimization guidance for LLM inference.
Similar Papers
Systematic Characterization of LLM Quantization: A Performance, Energy, and Quality Perspective
Performance
Makes AI models run faster and use less power.
Statistical Modeling and Uncertainty Estimation of LLM Inference Systems
Performance
Helps AI understand computer tasks better.
Forecasting LLM Inference Performance via Hardware-Agnostic Analytical Modeling
Performance
Predicts AI speed on any device before use.