Score: 0

Training Report of TeleChat3-MoE

Published: December 30, 2025 | arXiv ID: 2512.24157v1

By: Xinzhang Liu , Chao Wang , Zhihao Yang and more

TeleChat3-MoE is the latest series of TeleChat large language models, featuring a Mixture-of-Experts (MoE) architecture with parameter counts ranging from 105 billion to over one trillion,trained end-to-end on Ascend NPU cluster. This technical report mainly presents the underlying training infrastructure that enables reliable and efficient scaling to frontier model sizes. We detail systematic methodologies for operator-level and end-to-end numerical accuracy verification, ensuring consistency across hardware platforms and distributed parallelism strategies. Furthermore, we introduce a suite of performance optimizations, including interleaved pipeline scheduling, attention-aware data scheduling for long-sequence training,hierarchical and overlapped communication for expert parallelism, and DVM-based operator fusion. A systematic parallelization framework, leveraging analytical estimation and integer linear programming, is also proposed to optimize multi-dimensional parallelism configurations. Additionally, we present methodological approaches to cluster-level optimizations, addressing host- and device-bound bottlenecks during large-scale training tasks. These infrastructure advancements yield significant throughput improvements and near-linear scaling on clusters comprising thousands of devices, providing a robust foundation for large-scale language model development on hardware ecosystems.

MegaScale-MoE: Large-Scale Communication-Efficient Training of Mixture-of-Experts Models in Production

Machine Learning (CS)

Makes big AI learn much faster.

16 May 2025 0

89%

Uni-MoE-2.0-Omni: Scaling Language-Centric Omnimodal Large Model with Advanced MoE, Training and Data

Computation and Language

Computer understands and makes text, images, and sound.

16 Nov 2025 2

88%

MoMoE: A Mixture of Expert Agent Model for Financial Sentiment Analysis

Computational Engineering, Finance, and Science

Makes AI smarter by letting many AI parts work together.

17 Nov 2025 0

View PDF Login to Bookmark

Training Report of TeleChat3-MoE

Technical Abstract

MegaScale-MoE: Large-Scale Communication-Efficient Training of Mixture-of-Experts Models in Production

Uni-MoE-2.0-Omni: Scaling Language-Centric Omnimodal Large Model with Advanced MoE, Training and Data

MoMoE: A Mixture of Expert Agent Model for Financial Sentiment Analysis