Score: 1

Cetvel: A Unified Benchmark for Evaluating Language Understanding, Generation and Cultural Capacity of LLMs for Turkish

Published: August 22, 2025 | arXiv ID: 2508.16431v1

By: Yakup Abrek Er , Ilker Kesen , Gözde Gül Şahin and more

Potential Business Impact:

Tests how well computer programs understand Turkish.

Business Areas:

Language Learning Education

We introduce Cetvel, a comprehensive benchmark designed to evaluate large language models (LLMs) in Turkish. Existing Turkish benchmarks often lack either task diversity or culturally relevant content, or both. Cetvel addresses these gaps by combining a broad range of both discriminative and generative tasks ensuring content that reflects the linguistic and cultural richness of Turkish language. Cetvel covers 23 tasks grouped into seven categories, including tasks such as grammatical error correction, machine translation, and question answering rooted in Turkish history and idiomatic language. We evaluate 33 open-weight LLMs (up to 70B parameters) covering different model families and instruction paradigms. Our experiments reveal that Turkish-centric instruction-tuned models generally underperform relative to multilingual or general-purpose models (e.g. Llama 3 and Mistral), despite being tailored for the language. Moreover, we show that tasks such as grammatical error correction and extractive question answering are particularly discriminative in differentiating model capabilities. Cetvel offers a comprehensive and culturally grounded evaluation suite for advancing the development and assessment of LLMs in Turkish.

Büyük Dil Modelleri için TR-MMLU Benchmarkı: Performans Değerlendirmesi, Zorluklar ve İyileştirme Fırsatları

Computation and Language

Tests how well computers understand Turkish language.

18 Aug 2025 1

88%

MELAC: Massive Evaluation of Large Language Models with Alignment of Culture in Persian Language

Computation and Language

Helps computers understand Persian language and culture better.

1 Aug 2025 1

88%

Evaluating Arabic Large Language Models: A Survey of Benchmarks, Methods, and Gaps

Computation and Language

Helps computers understand Arabic better.

15 Oct 2025 2

View PDF Login to Bookmark

Country of Origin

🇹🇷 Turkey

Repos / Data Links

github.com

Page Count

31 pages

Cetvel: A Unified Benchmark for Evaluating Language Understanding, Generation and Cultural Capacity of LLMs for Turkish

Tests how well computer programs understand Turkish.

Technical Abstract

Büyük Dil Modelleri için TR-MMLU Benchmarkı: Performans Değerlendirmesi, Zorluklar ve İyileştirme Fırsatları

MELAC: Massive Evaluation of Large Language Models with Alignment of Culture in Persian Language

Evaluating Arabic Large Language Models: A Survey of Benchmarks, Methods, and Gaps