Score: 2

Ultra-FineWeb: Efficient Data Filtering and Verification for High-Quality LLM Training Data

Published: May 8, 2025 | arXiv ID: 2505.05427v1

By: Yudong Wang , Zixuan Fu , Jie Cai and more

Potential Business Impact:

Makes AI smarter by cleaning its learning data.

Business Areas:

Text Analytics Data and Analytics, Software

Data quality has become a key factor in enhancing model performance with the rapid development of large language models (LLMs). Model-driven data filtering has increasingly become a primary approach for acquiring high-quality data. However, it still faces two main challenges: (1) the lack of an efficient data verification strategy makes it difficult to provide timely feedback on data quality; and (2) the selection of seed data for training classifiers lacks clear criteria and relies heavily on human expertise, introducing a degree of subjectivity. To address the first challenge, we introduce an efficient verification strategy that enables rapid evaluation of the impact of data on LLM training with minimal computational cost. To tackle the second challenge, we build upon the assumption that high-quality seed data is beneficial for LLM training, and by integrating the proposed verification strategy, we optimize the selection of positive and negative samples and propose an efficient data filtering pipeline. This pipeline not only improves filtering efficiency, classifier quality, and robustness, but also significantly reduces experimental and inference costs. In addition, to efficiently filter high-quality data, we employ a lightweight classifier based on fastText, and successfully apply the filtering pipeline to two widely-used pre-training corpora, FineWeb and Chinese FineWeb datasets, resulting in the creation of the higher-quality Ultra-FineWeb dataset. Ultra-FineWeb contains approximately 1 trillion English tokens and 120 billion Chinese tokens. Empirical results demonstrate that the LLMs trained on Ultra-FineWeb exhibit significant performance improvements across multiple benchmark tasks, validating the effectiveness of our pipeline in enhancing both data quality and training efficiency.

Going over Fine Web with a Fine-Tooth Comb: Technical Report of Indexing Fine Web for Problematic Content Search and Retrieval

Computation and Language

Finds bad stuff in AI training data fast.

29 Aug 2025 1

89%

Aleph-Alpha-GermanWeb: Improving German-language LLM pre-training with model-based data curation and synthetic data generation

Computation and Language

Makes German computer language models smarter.

24 Apr 2025 1

87%

Augmented Relevance Datasets with Fine-Tuned Small LLMs

Information Retrieval

Helps computers learn what search results are best.

14 Apr 2025 0

View PDF Login to Bookmark

Country of Origin

🇨🇳 China

Repos / Data Links

github.com huggingface.co huggingface.co huggingface.co

Page Count

19 pages

Ultra-FineWeb: Efficient Data Filtering and Verification for High-Quality LLM Training Data

Makes AI smarter by cleaning its learning data.

Technical Abstract

Going over Fine Web with a Fine-Tooth Comb: Technical Report of Indexing Fine Web for Problematic Content Search and Retrieval

Aleph-Alpha-GermanWeb: Improving German-language LLM pre-training with model-based data curation and synthetic data generation

Augmented Relevance Datasets with Fine-Tuned Small LLMs