Parameterized Synthetic Text Generation with SimpleStories
By: Lennart Finke , Chandan Sreedhara , Thomas Dooms and more
Potential Business Impact:
Creates simple stories for AI to learn from.
We present SimpleStories, a large synthetic story dataset in simple language, consisting of 2 million samples each in English and Japanese. Through parameterizing prompts at multiple levels of abstraction, we achieve control over story characteristics at scale, inducing syntactic and semantic diversity. Ablations on a newly trained model suite show improved sample efficiency and model interpretability compared to the TinyStories dataset. We open-source all constituent parts of model creation, hoping to enable novel ways to study the end-to-end training process. As a byproduct, we move the frontier regarding the fewest-parameter language model that outputs grammatical natural language.
Similar Papers
TF1-EN-3M: Three Million Synthetic Moral Fables for Training Small, Open Language Models
Computation and Language
Teaches computers to tell moral stories.
Less is More: Adaptive Coverage for Synthetic Training Data
Machine Learning (CS)
Makes AI learn better with less fake data.
SRS-Stories: Vocabulary-constrained multilingual story generation for language learning
Computation and Language
Teaches new words by making fun, personalized stories.