Beyond Shallow Heuristics: Leveraging Human Intuition for Curriculum Learning
By: Vanessa Toborek , Sebastian Müller , Tim Selbach and more
Potential Business Impact:
Teaches computers by showing them easy words first.
Curriculum learning (CL) aims to improve training by presenting data from "easy" to "hard", yet defining and measuring linguistic difficulty remains an open challenge. We investigate whether human-curated simple language can serve as an effective signal for CL. Using the article-level labels from the Simple Wikipedia corpus, we compare label-based curricula to competence-based strategies relying on shallow heuristics. Our experiments with a BERT-tiny model show that adding simple data alone yields no clear benefit. However, structuring it via a curriculum -- especially when introduced first -- consistently improves perplexity, particularly on simple language. In contrast, competence-based curricula lead to no consistent gains over random ordering, probably because they fail to effectively separate the two classes. Our results suggest that human intuition about linguistic difficulty can guide CL for language model pre-training.
Similar Papers
Influence-driven Curriculum Learning for Pre-training on Limited Data
Computation and Language
Teaches computers to learn faster by sorting lessons.
What Makes a Good Curriculum? Disentangling the Effects of Data Ordering on LLM Mathematical Reasoning
Machine Learning (CS)
Teaches computers math better by sorting problems.
Beyond Random Sampling: Efficient Language Model Pretraining via Curriculum Learning
Computation and Language
Teaches computers to learn faster and better.