Relative Scaling Laws for LLMs
By: William Held , David Hall , Percy Liang and more
Potential Business Impact:
Shows how AI gets better, but not equally.
Scaling laws describe how language models improve with additional data, parameters, and compute. While widely used, they are typically measured on aggregate test sets. Aggregate evaluations yield clean trends but average over heterogeneous subpopulations, obscuring performance disparities. We introduce relative scaling laws, which track how performance gaps between test distributions evolve with scale rather than focusing solely on absolute error. Using 255 decoder-only Transformers trained under matched-compute (IsoFLOP) budgets from $10^{18}$--$10^{20}$ FLOPs on standard pretraining datasets, we find diverse trajectories: academic domains on MMLU converge toward parity; regional English dialects shift depending on population size; and clusters of AI risk behaviours split, with capability- and influence-related risks increasing during pretraining while adversarial risks do not. These results show that although scaling improves overall performance, it is not a universal equalizer. To support further study, we release all model checkpoints from this work to enable practitioners to measure relative alongside traditional scaling laws, in order to better prioritize robustness challenges in light of the bitter lesson.
Similar Papers
Scaling Laws for Robust Comparison of Open Foundation Language-Vision Models and Datasets
Machine Learning (CS)
Helps pick the best AI for learning from pictures.
Relative-Based Scaling Law for Neural Language Models
Machine Learning (CS)
Makes AI better understand word order.
Scaling Laws for Code: A More Data-Hungry Regime
Computation and Language
Makes computer code smarter with more data.