Recently, Huawei has made further progress in the field of MoE model training and officially launched the new model Pangu Ultra MoE with a parameter scale of up to 718 billion. This is a quasi-ten-trillion MoE model trained throughout the process on the Ascend AI Computing Platform. Huawei also released the technical report of the Pangu Ultra MoE model architecture and training methods, disclosing many technical details, fully demonstrating the leap in super-large-scale MoE training performance on Ascend. Training ultra-large-scale and highly sparse MoE models poses great challenges, and the stability during the training process is often difficult to ensure. To address this issue, the Pangu team made innovative designs in both model architecture and training methods, successfully achieving full-process training of the quasi-ten-trillion MoE model on the Ascend platform. In terms of model architecture, the Pangu team proposed the Depth-Scaled Sandwich-Norm (DSSN) stable architecture and TinyInit initialization method, achieving long-term stable training with over 18 TB of data on the Ascend platform. In addition, they introduced the EP loss load optimization method, which not only ensures good load balancing among experts but also enhances their domain specialization capabilities. Meanwhile, the Pangu Ultra MoE uses advanced industry MLA and MTP architectures, employing Dropless training strategies in both pre-training and post-training stages, achieving the optimal balance between model effectiveness and efficiency in ultra-large-scale MoE architectures. In terms of training methods, the Huawei team first disclosed key technologies for efficiently integrating large sparsity ratio MoE reinforcement learning (RL) post-training frameworks on Ascend CloudMatrix 384 super nodes, enabling RL post-training to enter the super node cluster era. Additionally, based on the pre-training system acceleration technology released in early May, within less than a month, the Huawei team completed another round of iterative upgrades, including: adaptive pipeline masking strategies compatible with Ascend hardware, further optimizing operator execution sequences, reducing Host-Bound and improving EP communication masking; developing adaptive memory optimization strategies; achieving DP inter-Attention load balancing through data reordering; and Ascend-friendly operator optimizations, resulting in a significant increase in MFU from 30% to 41% in pre-training on ten-thousand-card clusters. Moreover, the recently released Pangu Pro MoE large model, with only 72 billion parameters and activating 16 billion parameters, achieved excellent performance through innovative dynamic activation expert network design, even rivaling the performance of trillion-level models. In the authoritative large model ranking SuperCLUE announced in May 2025, it ranked first domestically among large models with parameter scales within one trillion. The release of Huawei's Pangu Ultra MoE and Pangu Pro MoE series models demonstrates that Huawei has not only completed全流程self-controlled training practices of domestic computing power + domestic models but also achieved industry-leading performance in cluster training systems. This means that the self-innovation capability of domestic AI infrastructure has been further validated, providing a "stabilizing pill" for the development of China's artificial intelligence industry. Original Source: https://www.toutiao.com/article/7510149512040677888/ Disclaimer: The article solely represents the author's personal views. Please express your attitude by clicking the "Agree/Disagree" buttons below.