Motif-2-12.7B-Reasoning: A Practitioner's Guide to RL Training Recipes
By: Junghwan Lim , Sungmin Lee , Dongseok Kim and more
We introduce Motif-2-12.7B-Reasoning, a 12.7B parameter language model designed to bridge the gap between open-weight systems and proprietary frontier models in complex reasoning and long-context understanding. Addressing the common challenges of model collapse and training instability in reasoning adaptation, we propose a comprehensive, reproducible training recipe spanning system, data, and algorithmic optimizations. Our approach combines memory-efficient infrastructure for 64K-token contexts using hybrid parallelism and kernel-level optimizations with a two-stage Supervised Fine-Tuning (SFT) curriculum that mitigates distribution mismatch through verified, aligned synthetic data. Furthermore, we detail a robust Reinforcement Learning Fine-Tuning (RLFT) pipeline that stabilizes training via difficulty-aware data filtering and mixed-policy trajectory reuse. Empirical results demonstrate that Motif-2-12.7B-Reasoning achieves performance comparable to models with significantly larger parameter counts across mathematics, coding, and agentic benchmarks, offering the community a competitive open model and a practical blueprint for scaling reasoning capabilities under realistic compute constraints.
Similar Papers
RoboGPT-R1: Enhancing Robot Planning with Reinforcement Learning
Artificial Intelligence
Robots learn to follow complex instructions better.
RoboGPT-R1: Enhancing Robot Planning with Reinforcement Learning
Artificial Intelligence
Robots learn to follow complex instructions better.
Reasoning Under 1 Billion: Memory-Augmented Reinforcement Learning for Large Language Models
Machine Learning (CS)
Helps small AI learn to think better.