SampleMix: A Sample-wise Pre-training Data Mixing Strategey by Coordinating Data Quality and Diversity
By: Xiangyu Xi , Deyang Kong , Jian Yang and more
Potential Business Impact:
Makes AI smarter by picking the best training words.
Existing pretraining data mixing methods for large language models (LLMs) typically follow a domain-wise methodology, a top-down process that first determines domain weights and then performs uniform data sampling across each domain. However, these approaches neglect significant inter-domain overlaps and commonalities, failing to control the global diversity of the constructed training dataset. Further, uniform sampling within domains ignores fine-grained sample-specific features, potentially leading to suboptimal data distribution. To address these shortcomings, we propose a novel sample-wise data mixture approach based on a bottom-up paradigm. This method performs global cross-domain sampling by systematically evaluating the quality and diversity of each sample, thereby dynamically determining the optimal domain distribution. Comprehensive experiments across multiple downstream tasks and perplexity assessments demonstrate that SampleMix surpasses existing domain-based methods. Meanwhile, SampleMix requires 1.4x to 2.1x training steps to achieves the baselines' performance, highlighting the substantial potential of SampleMix to optimize pre-training data.
Similar Papers
QuaDMix: Quality-Diversity Balanced Data Selection for Efficient LLM Pretraining
Computation and Language
Makes AI smarter by picking the best training words.
A Cross-Domain Few-Shot Learning Method Based on Domain Knowledge Mapping
CV and Pattern Recognition
Teaches computers to learn new things faster.
Sampling and Loss Weights in Multi-Domain Training
Machine Learning (CS)
Helps computers learn better from different data.