Score: 0

Tahakom LLM guidelines and receipts: from pre-training data to an Arabic LLM

Published: October 15, 2025 | arXiv ID: 2510.13481v1

By: Areej AlOtaibi , Lina Alyahya , Raghad Alshabanah and more

Potential Business Impact:

Helps computers understand and speak Arabic better.

Business Areas:

Natural Language Processing Artificial Intelligence, Data and Analytics, Software

Large Language Models (LLMs) have significantly advanced the field of natural language processing, enhancing capabilities in both language understanding and generation across diverse domains. However, developing LLMs for Arabic presents unique challenges. This paper explores these challenges by focusing on critical aspects such as data curation, tokenizer design, and evaluation. We detail our approach to the collection and filtration of Arabic pre-training datasets, assess the impact of various tokenizer designs on model performance, and examine the limitations of existing Arabic evaluation frameworks, for which we propose a systematic corrective methodology. To promote transparency and facilitate collaborative development, we share our data and methodologies, contributing to the advancement of language modeling, particularly for the Arabic language.

Large Language Models and Arabic Content: A Review

Computation and Language

Helps computers understand and use Arabic language better.

12 May 2025 0

90%

Evaluating Arabic Large Language Models: A Survey of Benchmarks, Methods, and Gaps

Computation and Language

Helps computers understand Arabic better.

15 Oct 2025 2

90%

Evaluating Arabic Large Language Models: A Survey of Benchmarks, Methods, and Gaps

Computation and Language

Tests how well computer programs understand Arabic.

15 Oct 2025 2

View PDF Login to Bookmark

Page Count

29 pages

Tahakom LLM guidelines and receipts: from pre-training data to an Arabic LLM

Helps computers understand and speak Arabic better.

Technical Abstract

Large Language Models and Arabic Content: A Review

Evaluating Arabic Large Language Models: A Survey of Benchmarks, Methods, and Gaps

Evaluating Arabic Large Language Models: A Survey of Benchmarks, Methods, and Gaps