Going over Fine Web with a Fine-Tooth Comb: Technical Report of Indexing Fine Web for Problematic Content Search and Retrieval
By: Inés Altemir Marinas, Anastasiia Kucherenko, Andrei Kucharavy
Potential Business Impact:
Finds bad stuff in AI training data fast.
Large language models (LLMs) rely heavily on web-scale datasets like Common Crawl, which provides over 80\% of training data for some modern models. However, the indiscriminate nature of web crawling raises challenges in data quality, safety, and ethics. Despite the critical importance of training data quality, prior research on harmful content has been limited to small samples due to computational constraints. This project presents a framework for indexing and analyzing LLM training datasets using an ElasticSearch-based pipeline. We apply it to SwissAI's FineWeb-2 corpus (1.5TB, four languages), achieving fast query performance--most searches in milliseconds, all under 2 seconds. Our work demonstrates real-time dataset analysis, offering practical tools for safer, more accountable AI systems.
Similar Papers
Ultra-FineWeb: Efficient Data Filtering and Verification for High-Quality LLM Training Data
Computation and Language
Makes AI smarter by cleaning its learning data.
Towards Safer Pretraining: Analyzing and Filtering Harmful Content in Webscale datasets for Responsible LLMs
Computation and Language
Cleans up computer brains to stop bad ideas.
Aleph-Alpha-GermanWeb: Improving German-language LLM pre-training with model-based data curation and synthetic data generation
Computation and Language
Makes German computer language models smarter.