dots.llm1 Technical Report
By: Bi Huo , Bin Tu , Cheng Qin and more
Potential Business Impact:
Makes smart computer programs learn faster and cheaper.
Mixture of Experts (MoE) models have emerged as a promising paradigm for scaling language models efficiently by activating only a subset of parameters for each input token. In this report, we present dots.llm1, a large-scale MoE model that activates 14B parameters out of a total of 142B parameters, delivering performance on par with state-of-the-art models while reducing training and inference costs. Leveraging our meticulously crafted and efficient data processing pipeline, dots.llm1 achieves performance comparable to Qwen2.5-72B after pretraining on 11.2T high-quality tokens and post-training to fully unlock its capabilities. Notably, no synthetic data is used during pretraining. To foster further research, we open-source intermediate training checkpoints at every one trillion tokens, providing valuable insights into the learning dynamics of large language models.
Similar Papers
Every FLOP Counts: Scaling a 300B Mixture-of-Experts LING LLM without Premium GPUs
Machine Learning (CS)
Trains big AI models cheaper and faster.
ToMoE: Converting Dense Large Language Models to Mixture-of-Experts through Dynamic Structural Pruning
Machine Learning (CS)
Makes smart computer programs smaller and faster.
Can Mixture-of-Experts Surpass Dense LLMs Under Strictly Equal Resources?
Computation and Language
Makes smart computer programs learn more with same effort.