Score: 1

Parameterized Synthetic Text Generation with SimpleStories

Published: April 12, 2025 | arXiv ID: 2504.09184v3

By: Lennart Finke , Chandan Sreedhara , Thomas Dooms and more

Potential Business Impact:

Creates simple stories for AI to learn from.

Business Areas:
Text Analytics Data and Analytics, Software

We present SimpleStories, a large synthetic story dataset in simple language, consisting of 2 million samples each in English and Japanese. Through parameterizing prompts at multiple levels of abstraction, we achieve control over story characteristics at scale, inducing syntactic and semantic diversity. Ablations on a newly trained model suite show improved sample efficiency and model interpretability compared to the TinyStories dataset. We open-source all constituent parts of model creation, hoping to enable novel ways to study the end-to-end training process. As a byproduct, we move the frontier regarding the fewest-parameter language model that outputs grammatical natural language.

Country of Origin
🇨🇭 Switzerland


Page Count
16 pages

Category
Computer Science:
Computation and Language