BLiSS 1.0: Evaluating Bilingual Learner Competence in Second Language Small Language Models
By: Yuan Gao , Suchir Salhan , Andrew Caines and more
Potential Business Impact:
Tests if AI learns language like kids do.
To bridge the gap between performance-oriented benchmarks and the evaluation of cognitively inspired models, we introduce BLiSS 1.0, a Benchmark of Learner Interlingual Syntactic Structure. Our benchmark operationalizes a new paradigm of selective tolerance, testing whether a model finds a naturalistic learner error more plausible than a matched, artificial error within the same sentence. Constructed from over 2.8 million naturalistic learner sentences, BLiSS provides 136,867 controlled triplets (corrected, learner, artificial) for this purpose. Experiments on a diverse suite of models demonstrate that selective tolerance is a distinct capability from standard grammaticality, with performance clustering strongly by training paradigm. This validates BLiSS as a robust tool for measuring how different training objectives impact a model's alignment with the systematic patterns of human language acquisition.
Similar Papers
How Good is BLI as an Alignment Measure: A Study in Word Embedding Paradigm
Computation and Language
Helps computers understand many languages better.
Mind the Language Gap: Automated and Augmented Evaluation of Bias in LLMs for High- and Low-Resource Languages
Computation and Language
Finds and fixes unfairness in AI language.
MultiBLiMP 1.0: A Massively Multilingual Benchmark of Linguistic Minimal Pairs
Computation and Language
Tests computers on understanding many languages.