Annotating and Inferring Compositional Structures in Numeral Systems Across Languages
By: Arne Rubehn , Christoph Rzymski , Luca Ciucci and more
Potential Business Impact:
Helps computers understand number words in any language.
Numeral systems across the world's languages vary in fascinating ways, both regarding their synchronic structure and the diachronic processes that determined how they evolved in their current shape. For a proper comparison of numeral systems across different languages, however, it is important to code them in a standardized form that allows for the comparison of basic properties. Here, we present a simple but effective coding scheme for numeral annotation, along with a workflow that helps to code numeral systems in a computer-assisted manner, providing sample data for numerals from 1 to 40 in 25 typologically diverse languages. We perform a thorough analysis of the sample, focusing on the systematic comparison between the underlying and the surface morphological structure. We further experiment with automated models for morpheme segmentation, where we find allomorphy as the major reason for segmentation errors. Finally, we show that subword tokenization algorithms are not viable for discovering morphemes in low-resource scenarios.
Similar Papers
Investigating the interaction of linguistic and mathematical reasoning in language models using multilingual number puzzles
Computation and Language
Computers learn math from different number words.
Recursive numeral systems are highly regular and easy to process
Computation and Language
Makes number words easier to learn and use.
Recursive numeral systems are highly regular and easy to process
Computation and Language
Makes language rules simpler and easier to learn.