Score: 1

Palm: A Culturally Inclusive and Linguistically Diverse Dataset for Arabic LLMs

Published: February 28, 2025 | arXiv ID: 2503.00151v2

By: Fakhraddin Alwajih , Abdellah El Mekki , Samar Mohamed Magdy and more

Potential Business Impact:

Teaches computers to understand Arabic culture and dialects.

Business Areas:
Natural Language Processing Artificial Intelligence, Data and Analytics, Software

As large language models (LLMs) become increasingly integrated into daily life, ensuring their cultural sensitivity and inclusivity is paramount. We introduce our dataset, a year-long community-driven project covering all 22 Arab countries. The dataset includes instructions (input, response pairs) in both Modern Standard Arabic (MSA) and dialectal Arabic (DA), spanning 20 diverse topics. Built by a team of 44 researchers across the Arab world, all of whom are authors of this paper, our dataset offers a broad, inclusive perspective. We use our dataset to evaluate the cultural and dialectal capabilities of several frontier LLMs, revealing notable limitations. For instance, while closed-source LLMs generally exhibit strong performance, they are not without flaws, and smaller open-source models face greater challenges. Moreover, certain countries (e.g., Egypt, the UAE) appear better represented than others (e.g., Iraq, Mauritania, Yemen). Our annotation guidelines, code, and data for reproducibility are publicly available.

Country of Origin
🇨🇦 Canada

Repos / Data Links

Page Count
24 pages

Category
Computer Science:
Computation and Language