Recipes for Pre-training LLMs with MXFP8
By: Asit Mishra , Dusan Stosic , Simon Layton and more
Potential Business Impact:
Makes computers learn faster using less memory.
Using fewer bits to represent model parameters and related tensors during pre-training has become a required technique for improving GPU efficiency without sacrificing accuracy. Microscaling (MX) formats introduced in NVIDIA Blackwell generation of GPUs represent a major advancement of this technique, making it practical to combine narrow floating-point data types with finer granularity per-block scaling factors. In turn, this enables both quantization of more tensors than previous approaches and more efficient execution of operations on those tensors. Effective use of MX-formats requires careful choices of various parameters. In this paper we review these choices and show how MXFP8-E4M3 datatype and a specific number conversion algorithm result in training sessions that match those carried out in BF16. We present results using models with up to 8B parameters, trained on high-quality datasets of up to 15T tokens.
Similar Papers
Training LLMs with MXFP4
Machine Learning (CS)
Trains AI faster with less data loss.
MX+: Pushing the Limits of Microscaling Formats for Efficient Large Language Model Serving
Machine Learning (CS)
Makes AI understand words better with less computer power.
MicroMix: Efficient Mixed-Precision Quantization with Microscaling Formats for Large Language Models
Machine Learning (CS)
Makes AI models run much faster and use less power.