Score: 2

BabyBabelLM: A Multilingual Benchmark of Developmentally Plausible Training Data

Published: October 11, 2025 | arXiv ID: 2510.10159v1

By: Jaap Jumelet , Abdellah Fourtassi , Akari Haga and more

Potential Business Impact:

Teaches computers to learn languages like babies.

Business Areas:

Natural Language Processing Artificial Intelligence, Data and Analytics, Software

We present BabyBabelLM, a multilingual collection of datasets modeling the language a person observes from birth until they acquire a native language. We curate developmentally plausible pretraining data aiming to cover the equivalent of 100M English words of content in each of 45 languages. We compile evaluation suites and train baseline models in each language. BabyBabelLM aims to facilitate multilingual pretraining and cognitive modeling.

Findings of the BabyLM Challenge: Sample-Efficient Pretraining on Developmentally Plausible Corpora

Computation and Language

Teaches computers to learn language like babies.

10 Apr 2025 3

90%

Towards Data-Efficient Language Models: A Child-Inspired Approach to Language Learning

Computation and Language

Teaches computers to learn language like kids.

6 Mar 2025 0

90%

BabyVLM: Data-Efficient Pretraining of VLMs Inspired by Infant Learning

CV and Pattern Recognition

Teaches computers to learn like babies.

13 Apr 2025 0

View PDF Login to Bookmark

Country of Origin

🇮🇱 🇳🇱 Israel, Netherlands

Repos / Data Links

github.com

Page Count

33 pages

BabyBabelLM: A Multilingual Benchmark of Developmentally Plausible Training Data

Teaches computers to learn languages like babies.

Technical Abstract

Findings of the BabyLM Challenge: Sample-Efficient Pretraining on Developmentally Plausible Corpora

Towards Data-Efficient Language Models: A Child-Inspired Approach to Language Learning

BabyVLM: Data-Efficient Pretraining of VLMs Inspired by Infant Learning