Beyond URLs: Metadata Diversity and Position for Efficient LLM Pretraining
By: Dongyang Fan , Diba Hashemi , Sai Praneeth Karimireddy and more
Potential Business Impact:
Makes AI learn much faster with extra clues.
Incorporating metadata in Large Language Models (LLMs) pretraining has recently emerged as a promising approach to accelerate training. However prior work highlighted only one useful signal-URLs, leaving open the question of whether other forms of metadata could yield greater benefits. In this study, we investigate a wider range of metadata types and find other types of metadata, such as fine-grained indicators of document quality that can also accelerate pretraining when prepended. We identify a common feature among effective metadata: they encode information at a finer granularity. We further introduce metadata appending as a means of improving training efficiency, where predicting an appropriate metadata as auxiliary task can help speed up pretraining. In addition, learnable meta-tokens trained with masked loss can recover part of the speedup by inducing quality-aware latent structure. Using probing, we analyze latent representations to understand how metadata shapes learning. Together, these results yield practical guidelines for integrating metadata to improve both the efficiency and effectiveness of LLM pretraining.
Similar Papers
LIME: Making LLM Data More Efficient with Linguistic Metadata Embeddings
Computation and Language
Teaches computers language faster and better.
When Does Metadata Conditioning (NOT) Work for Language Model Pre-Training? A Study with Context-Free Grammars
Computation and Language
Adds helpful hints to AI for better understanding.
Flexible metadata harvesting for ecology using large language models
Digital Libraries
Finds and links science data for new discoveries.