LLMQ: Efficient Lower-Precision Pretraining for Consumer GPUs
By: Erik Schultheis, Dan Alistarh
Potential Business Impact:
Trains big AI models on normal computers.
We present LLMQ, an end-to-end CUDA/C++ implementation for medium-sized language-model training, e.g. 3B to 32B parameters, on affordable, commodity GPUs. These devices are characterized by low memory availability and slow communication compared to datacentre-grade GPUs. Consequently, we showcase a range of optimizations that target these bottlenecks, including activation checkpointing, offloading, and copy-engine based collectives. LLMQ is able to train or fine-tune a 7B model on a single 16GB mid-range gaming card, or a 32B model on a workstation equipped with 4 RTX 4090s. This is achieved while executing a standard 8-bit training pipeline, without additional algorithmic approximations, and maintaining FLOP utilization of around 50%. The efficiency of LLMQ rivals that of production-scale systems on much more expensive cloud-grade GPUs.
Similar Papers
Performance Trade-offs of Optimizing Small Language Models for E-Commerce
Artificial Intelligence
Makes small computers understand online shoppers better.
Optimizing LLMs Using Quantization for Mobile Execution
Machine Learning (CS)
Makes big AI models fit on your phone.
ELUTQ: Efficient LUT-Aware Quantization for Deploying Large Language Models on Edge Devices
Machine Learning (CS)
Makes smart AI run on phones, faster and smaller.