Evaluating Arabic Large Language Models: A Survey of Benchmarks, Methods, and Gaps
By: Ahmed Alzubaidi , Shaikha Alsuwaidi , Basma El Amel Boussaha and more
Potential Business Impact:
Tests how well computer programs understand Arabic.
This survey provides the first systematic review of Arabic LLM benchmarks, analyzing 40+ evaluation benchmarks across NLP tasks, knowledge domains, cultural understanding, and specialized capabilities. We propose a taxonomy organizing benchmarks into four categories: Knowledge, NLP Tasks, Culture and Dialects, and Target-Specific evaluations. Our analysis reveals significant progress in benchmark diversity while identifying critical gaps: limited temporal evaluation, insufficient multi-turn dialogue assessment, and cultural misalignment in translated datasets. We examine three primary approaches: native collection, translation, and synthetic generation discussing their trade-offs regarding authenticity, scale, and cost. This work serves as a comprehensive reference for Arabic NLP researchers, providing insights into benchmark methodologies, reproducibility standards, and evaluation metrics while offering recommendations for future development.
Similar Papers
Evaluating Arabic Large Language Models: A Survey of Benchmarks, Methods, and Gaps
Computation and Language
Helps computers understand Arabic better.
AraLingBench A Human-Annotated Benchmark for Evaluating Arabic Linguistic Capabilities of Large Language Models
Computation and Language
Tests if computer language skills are real.
Arabic Prompts with English Tools: A Benchmark
Artificial Intelligence
Tests AI's ability to use tools in Arabic.