Score: 2

MegaMath: Pushing the Limits of Open Math Corpora

Published: April 3, 2025 | arXiv ID: 2504.02807v1

By: Fan Zhou , Zengzhi Wang , Nikhil Ranjan and more

Potential Business Impact:

Teaches computers to solve math problems better.

Business Areas:

Text Analytics Data and Analytics, Software

Mathematical reasoning is a cornerstone of human intelligence and a key benchmark for advanced capabilities in large language models (LLMs). However, the research community still lacks an open, large-scale, high-quality corpus tailored to the demands of math-centric LLM pre-training. We present MegaMath, an open dataset curated from diverse, math-focused sources through following practices: (1) Revisiting web data: We re-extracted mathematical documents from Common Crawl with math-oriented HTML optimizations, fasttext-based filtering and deduplication, all for acquiring higher-quality data on the Internet. (2) Recalling Math-related code data: We identified high quality math-related code from large code training corpus, Stack-V2, further enhancing data diversity. (3) Exploring Synthetic data: We synthesized QA-style text, math-related code, and interleaved text-code blocks from web data or code data. By integrating these strategies and validating their effectiveness through extensive ablations, MegaMath delivers 371B tokens with the largest quantity and top quality among existing open math pre-training datasets.

Nemotron-CC-Math: A 133 Billion-Token-Scale High Quality Math Pretraining Dataset

Computation and Language

Teaches computers to solve math problems better.

20 Aug 2025 4

88%

TeleMath: A Benchmark for Large Language Models in Telecom Mathematical Problem Solving

Artificial Intelligence

Helps AI solve math problems for phone networks.

12 Jun 2025 1

88%

MathMist: A Parallel Multilingual Benchmark Dataset for Mathematical Problem Solving and Reasoning

Computation and Language

Helps computers solve math problems in many languages.

16 Oct 2025 1

View PDF Login to Bookmark

Repos / Data Links

github.com huggingface.co

Page Count

26 pages

MegaMath: Pushing the Limits of Open Math Corpora

Teaches computers to solve math problems better.

Technical Abstract

Nemotron-CC-Math: A 133 Billion-Token-Scale High Quality Math Pretraining Dataset

TeleMath: A Benchmark for Large Language Models in Telecom Mathematical Problem Solving

MathMist: A Parallel Multilingual Benchmark Dataset for Mathematical Problem Solving and Reasoning