Score: 2

An Expanded Massive Multilingual Dataset for High-Performance Language Technologies (HPLT)

Published: March 13, 2025 | arXiv ID: 2503.10267v3

By: Laurie Burchell , Ona de Gibert , Nikolay Arefyev and more

Potential Business Impact:

Helps computers learn many languages better.

Business Areas:

Natural Language Processing Artificial Intelligence, Data and Analytics, Software

Training state-of-the-art large language models requires vast amounts of clean and diverse textual data. However, building suitable multilingual datasets remains a challenge. In this work, we present HPLT v2, a collection of high-quality multilingual monolingual and parallel corpora, extending prior work of the HPLT project. The monolingual portion of the data contains 8T tokens covering 193 languages, while the parallel data contains 380M sentence pairs covering 51 languages. We document the entire data pipeline and release the code to reproduce it. We provide extensive analysis of the quality and characteristics of our data. Finally, we evaluate the performance of language models and machine translation systems trained on HPLT v2, demonstrating its value.

HPLT 3.0: Very Large-Scale Multilingual Resources for LLM and MT. Mono- and Bi-lingual Data, Multilingual Evaluation, and Pre-Trained Models

Computation and Language

Makes computers understand many languages better.

2 Nov 2025 1

93%

HPLT~3.0: Very Large-Scale Multilingual Resources for LLM and MT. Mono- and Bi-lingual Data, Multilingual Evaluation, and Pre-Trained Models

Computation and Language

Teaches computers many languages for better understanding.

2 Nov 2025 1

92%

DocHPLT: A Massively Multilingual Document-Level Translation Dataset

Computation and Language

Helps computers translate whole documents in many languages.

18 Aug 2025 1

View PDF Login to Bookmark

Repos / Data Links

github.com github.com github.com github.com github.com github.com github.com github.com github.com github.com

Page Count

35 pages

An Expanded Massive Multilingual Dataset for High-Performance Language Technologies (HPLT)

Helps computers learn many languages better.

Technical Abstract

HPLT 3.0: Very Large-Scale Multilingual Resources for LLM and MT. Mono- and Bi-lingual Data, Multilingual Evaluation, and Pre-Trained Models

HPLT~3.0: Very Large-Scale Multilingual Resources for LLM and MT. Mono- and Bi-lingual Data, Multilingual Evaluation, and Pre-Trained Models

DocHPLT: A Massively Multilingual Document-Level Translation Dataset