Score: 1

Reinforcement Pre-Training

Published: June 9, 2025 | arXiv ID: 2506.08007v1

By: Qingxiu Dong , Li Dong , Yao Tang and more

BigTech Affiliations: Microsoft

Potential Business Impact:

Teaches computers to guess words better.

Business Areas:
Natural Language Processing Artificial Intelligence, Data and Analytics, Software

In this work, we introduce Reinforcement Pre-Training (RPT) as a new scaling paradigm for large language models and reinforcement learning (RL). Specifically, we reframe next-token prediction as a reasoning task trained using RL, where it receives verifiable rewards for correctly predicting the next token for a given context. RPT offers a scalable method to leverage vast amounts of text data for general-purpose RL, rather than relying on domain-specific annotated answers. By incentivizing the capability of next-token reasoning, RPT significantly improves the language modeling accuracy of predicting the next tokens. Moreover, RPT provides a strong pre-trained foundation for further reinforcement fine-tuning. The scaling curves show that increased training compute consistently improves the next-token prediction accuracy. The results position RPT as an effective and promising scaling paradigm to advance language model pre-training.

Country of Origin
🇺🇸 United States

Page Count
15 pages

Category
Computer Science:
Computation and Language