Score: 0

Joint Selection for Large-Scale Pre-Training Data via Policy Gradient-based Mask Learning

Published: December 30, 2025 | arXiv ID: 2512.24265v1

By: Ziqing Fan , Yuqiao Xian , Yan Sun and more

A fine-grained data recipe is crucial for pre-training large language models, as it can significantly enhance training efficiency and model performance. One important ingredient in the recipe is to select samples based on scores produced by defined rules, LLM judgment, or statistical information in embeddings, which can be roughly categorized into quality and diversity metrics. Due to the high computational cost when applied to trillion-scale token pre-training datasets such as FineWeb and DCLM, these two or more types of metrics are rarely considered jointly in a single selection process. However, in our empirical study, selecting samples based on quality metrics exhibit severe diminishing returns during long-term pre-training, while selecting on diversity metrics removes too many valuable high-quality samples, both of which limit pre-trained LLMs' capabilities. Therefore, we introduce DATAMASK, a novel and efficient joint learning framework designed for large-scale pre-training data selection that can simultaneously optimize multiple types of metrics in a unified process, with this study focusing specifically on quality and diversity metrics. DATAMASK approaches the selection process as a mask learning problem, involving iterative sampling of data masks, computation of policy gradients based on predefined objectives with sampled masks, and updating of mask sampling logits. Through policy gradient-based optimization and various acceleration enhancements, it significantly reduces selection time by 98.9% compared to greedy algorithm, enabling our study to explore joint learning within trillion-scale tokens. With DATAMASK, we select a subset of about 10% from the 15 trillion-token FineWeb dataset, termed FineWeb-Mask. Evaluated across 12 diverse tasks, we achieves significant improvements of 3.2% on a 1.5B dense model and 1.9% on a 7B MoE model.

Beyond Data Filtering: Knowledge Localization for Capability Removal in LLMs

Machine Learning (CS)

Removes bad knowledge from AI without hurting good knowledge.

5 Dec 2025 1

88%

Dynamic Jointly Batch Selection for Data Efficient Machine Translation Fine-Tuning

Computation and Language

Makes computer translations much better and faster.

6 Nov 2025 0

88%

QuaDMix: Quality-Diversity Balanced Data Selection for Efficient LLM Pretraining

Computation and Language

Makes AI smarter by picking the best training words.

23 Apr 2025 1

View PDF Login to Bookmark

Joint Selection for Large-Scale Pre-Training Data via Policy Gradient-based Mask Learning

Technical Abstract

Beyond Data Filtering: Knowledge Localization for Capability Removal in LLMs

Dynamic Jointly Batch Selection for Data Efficient Machine Translation Fine-Tuning

QuaDMix: Quality-Diversity Balanced Data Selection for Efficient LLM Pretraining