Score: 2

KORMo: Korean Open Reasoning Model for Everyone

Published: October 10, 2025 | arXiv ID: 2510.09426v1

By: Minjun Kim , Hyeonseok Lim , Hangyeol Yoo and more

Potential Business Impact:

Makes smart computer talk Korean better with fake words.

Business Areas:

MOOC Education, Software

This work presents the first large-scale investigation into constructing a fully open bilingual large language model (LLM) for a non-English language, specifically Korean, trained predominantly on synthetic data. We introduce KORMo-10B, a 10.8B-parameter model trained from scratch on a Korean-English corpus in which 68.74% of the Korean portion is synthetic. Through systematic experimentation, we demonstrate that synthetic data, when carefully curated with balanced linguistic coverage and diverse instruction styles, does not cause instability or degradation during large-scale pretraining. Furthermore, the model achieves performance comparable to that of contemporary open-weight multilingual baselines across a wide range of reasoning, knowledge, and instruction-following benchmarks. Our experiments reveal two key findings: (1) synthetic data can reliably sustain long-horizon pretraining without model collapse, and (2) bilingual instruction tuning enables near-native reasoning and discourse coherence in Korean. By fully releasing all components including data, code, training recipes, and logs, this work establishes a transparent framework for developing synthetic data-driven fully open models (FOMs) in low-resource settings and sets a reproducible precedent for future multilingual LLM research.

Ko-MuSR: A Multistep Soft Reasoning Benchmark for LLMs Capable of Understanding Korean

Computation and Language

Helps computers understand long Korean stories better.

28 Oct 2025 2

88%

Uni-MoE-2.0-Omni: Scaling Language-Centric Omnimodal Large Model with Advanced MoE, Training and Data

Computation and Language

Computer understands and makes text, images, and sound.

16 Nov 2025 2

87%

KoSimpleQA: A Korean Factuality Benchmark with an Analysis of Reasoning LLMs

Computation and Language

Tests if AI knows Korean facts correctly.

21 Oct 2025 0

View PDF Login to Bookmark

Country of Origin

🇰🇷 Korea, Republic of

Repos / Data Links

github.com github.com github.com github.com huggingface.co huggingface.co huggingface.co huggingface.co

Page Count

45 pages

KORMo: Korean Open Reasoning Model for Everyone

Makes smart computer talk Korean better with fake words.

Technical Abstract

Ko-MuSR: A Multistep Soft Reasoning Benchmark for LLMs Capable of Understanding Korean

Uni-MoE-2.0-Omni: Scaling Language-Centric Omnimodal Large Model with Advanced MoE, Training and Data

KoSimpleQA: A Korean Factuality Benchmark with an Analysis of Reasoning LLMs