Score: 1

FineFreq: A Multilingual Character Frequency Dataset from Web-Scale Text

Published: December 10, 2025 | arXiv ID: 2512.09701v1

By: Binbin XU

Potential Business Impact:

Helps computers understand all languages better.

Business Areas:

Text Analytics Data and Analytics, Software

We present FineFreq, a large-scale multilingual character frequency dataset derived from the FineWeb and FineWeb2 corpora, covering over 1900 languages and spanning 2013-2025. The dataset contains frequency counts for 96 trillion characters processed from 57 TB of compressed text. For each language, FineFreq provides per-character statistics with aggregate and year-level frequencies, allowing fine-grained temporal analysis. The dataset preserves naturally occurring multilingual features such as cross-script borrowings, emoji, and acronyms without applying artificial filtering. Each character entry includes Unicode metadata (category, script, block), enabling domain-specific or other downstream filtering and analysis. The full dataset is released in both CSV and Parquet formats, with associated metadata, available on GitHub and HuggingFace. https://github.com/Bin-2/FineFreq

Going over Fine Web with a Fine-Tooth Comb: Technical Report of Indexing Fine Web for Problematic Content Search and Retrieval

Computation and Language

Finds bad stuff in AI training data fast.

29 Aug 2025 1

84%

SenWave: A Fine-Grained Multi-Language Sentiment Analysis Dataset Sourced from COVID-19 Tweets

Computation and Language

Helps understand feelings about COVID-19 from tweets.

9 Oct 2025 2

83%

Ultra-FineWeb: Efficient Data Filtering and Verification for High-Quality LLM Training Data

Computation and Language

Makes AI smarter by cleaning its learning data.

8 May 2025 2

View PDF Login to Bookmark

Repos / Data Links

github.com

Page Count

8 pages

FineFreq: A Multilingual Character Frequency Dataset from Web-Scale Text

Helps computers understand all languages better.

Technical Abstract

Going over Fine Web with a Fine-Tooth Comb: Technical Report of Indexing Fine Web for Problematic Content Search and Retrieval

SenWave: A Fine-Grained Multi-Language Sentiment Analysis Dataset Sourced from COVID-19 Tweets

Ultra-FineWeb: Efficient Data Filtering and Verification for High-Quality LLM Training Data