Score: 0

Confounding Factors in Relating Model Performance to Morphology

Published: November 3, 2025 | arXiv ID: 2511.01380v1

By: Wessel Poelman, Thomas Bauwens, Miryam de Lhoneux

Potential Business Impact:

Helps computers understand different languages better.

Business Areas:

Natural Language Processing Artificial Intelligence, Data and Analytics, Software

The extent to which individual language characteristics influence tokenization and language modeling is an open question. Differences in morphological systems have been suggested as both unimportant and crucial to consider (Cotterell et al., 2018; Gerz et al., 2018a; Park et al., 2021, inter alia). We argue this conflicting evidence is due to confounding factors in experimental setups, making it hard to compare results and draw conclusions. We identify confounding factors in analyses trying to answer the question of whether, and how, morphology relates to language modeling. Next, we re-assess three hypotheses by Arnett & Bergen (2025) for why modeling agglutinative languages results in higher perplexities than fusional languages: they look at morphological alignment of tokenization, tokenization efficiency, and dataset size. We show that each conclusion includes confounding factors. Finally, we introduce token bigram metrics as an intrinsic way to predict the difficulty of causal language modeling, and find that they are gradient proxies for morphological complexity that do not require expert annotation. Ultimately, we outline necessities to reliably answer whether, and how, morphology relates to language modeling.

Rethinking Tokenization for Rich Morphology: The Dominance of Unigram over BPE and Morphological Alignment

Computation and Language

Helps computers understand languages better by breaking words.

11 Aug 2025 1

88%

Rethinking Tokenization for Rich Morphology: The Dominance of Unigram over BPE and Morphological Alignment

Computation and Language

Helps computers understand languages better by breaking words.

11 Aug 2025 1

87%

Resource-sensitive but language-blind: Community size and not grammatical complexity better predicts the accuracy of Large Language Models in a novel Wug Test

Computation and Language

Computers learn new words like people, but for data.

14 Oct 2025 0

View PDF Login to Bookmark

Country of Origin

🇧🇪 Belgium

Page Count

26 pages

Confounding Factors in Relating Model Performance to Morphology

Helps computers understand different languages better.

Technical Abstract

Rethinking Tokenization for Rich Morphology: The Dominance of Unigram over BPE and Morphological Alignment

Rethinking Tokenization for Rich Morphology: The Dominance of Unigram over BPE and Morphological Alignment

Resource-sensitive but language-blind: Community size and not grammatical complexity better predicts the accuracy of Large Language Models in a novel Wug Test