PEAK: A Performance Engineering AI-Assistant for GPU Kernels Powered by Natural Language Transformations
By: Muhammad Usman Tariq , Abhinav Jangda , Angelica Moreira and more
Advancements in large language models (LLMs) are showing promising impact in software development and programming assistance. However, these models struggle when operating on low-level backend code. This challenge is exacerbated in the domain of GPU kernels, where performance-critical details are coupled to rapidly evolving hardware characteristics and available code examples are sparse. In this work, we introduce PEAK, a Performance Engineering AI-Assistant for GPU Kernels powered by natural language transformations. PEAK utilizes the key insight that iterative code transformations (optimizations) can straightforwardly be written in natural language, and then carried out by LLMs. Thus, these transformations can be rapidly developed, encoding general portable optimizations, but also easily specialized to specific GPU devices and even kernels. These natural transformations are supported by a modular and extensible infrastructure that additionally performs validation and performance evaluation. We demonstrate the flexibility of PEAK by instantiating it for three backends, CUDA, HIP, and HLSL, and create 16 natural transformations for optimizing matrix multiplication kernels. We show that our resulting implementations are competitive with vendor libraries when available, and for HLSL (without a library) our implementations match the hardware documented FLOPS. PEAK allows the fine-grained exploration of several research questions around how LLMs behave in this domain, including characterizing transformations and their errors; and how performance evolves along optimization sequences. PEAK provides an interface that can either be utilized by performance engineers to improve productivity, or driven completely autonomously (e.g., by an AI agent), providing a forward-compatible design that can continue to improve with advances in AI capabilities.
Similar Papers
STARK: Strategic Team of Agents for Refining Kernels
Artificial Intelligence
AI makes computer graphics run much faster.
HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration
Distributed, Parallel, and Cluster Computing
AI helps computers run same code on different parts.
Parallel CPU-GPU Execution for LLM Inference on Constrained GPUs
Distributed, Parallel, and Cluster Computing
Makes AI chat faster on less powerful computers.