Inference deployment becomes the top priority for large model implementation.

From Google's introduction of Transformer — the most commonly used neural network architecture in artificial intelligence, to DeepSeek V3/R1's sudden popularity during the 2025 Spring Festival, the focus of super-large MoE architecture models has gradually shifted from training development to inference support application implementation.

The inference scenario is the "litmus test" of a large model's cognitive ability and the core capability for its commercial implementation. From抢先 launching DeepSeek models to the API service price war, in the era where inference reigns supreme, whoever can maximize the computational efficiency of inference deployment will truly achieve commercial success with large models.

Mathematics Supplements Physics, Maximizing Computational Efficiency

"Mathematics Supplements Physics," typically refers to using mathematical theories, algorithms, and modeling methods to overcome limitations in traditional physical device development regarding complex system analysis, large-scale computing, or multi-field coupling problems. Huawei's Rotating Chairman, Meng Wanzhou, mentioned in her New Year's address in 2025:

"Huawei's more than a dozen labs, together with engineers from partner companies, formed a 'miscellaneous team.' In the face of severe engineering challenges posed by the Tiancheng AI cluster system and single-chip performance, they creatively applied ideas such as mathematics supplementing physics, non-Moore supplementing Moore, and system supplementing single points. They broke through limits in fields like heat dissipation, power supply, high-speed, high-density, and reliability of large chips on boards."

Huawei's technical team’s optimization of inference technology for ultra-large MoE models also revolves around the idea of mathematics supplementing physics. By fully leveraging equivalent mathematical transformations—methods that enhance computational efficiency without changing the essential attributes of mathematical objects through algebraic deformation, logical conversion, or structural reconstruction—they have maximized the computational efficiency of hardware clusters. This includes point-to-face optimization technologies for inference frameworks, FlashComm general-purpose optimization technology transforming mathematical optimal realization into physical optimal realization, general-purpose extreme masking technology turning serial calculations into four-stream concurrent processing, Ascend MLA optimal realization using addition instead of multiplication, and a series of innovative operators with hardware awareness. These core technologies will be fully disclosed for the first time in a series of technical reports.

Open Source Sharing, Building a Lasting Open Collaboration Ecosystem

Building the Ascend ecosystem is not a one-time task. Besides sharing technical approaches through technical reports, the code related to these core technologies, which enables the realization of these advancements, will also be progressively open-sourced within less than a month. Please stay tuned at https://gitcode.com/ascend-tribe/ascend-inference-cluster for continuous updates.

While sharing technical ideas with the industry, we are also building a long-term, open collaboration ecosystem through open-source efforts. This allows Ascend-friendly technical capabilities to truly become active through these open-source projects, demonstrating Huawei's determination to build an open ecosystem. This gives experts who wish to try using Ascend capabilities the confidence to make long-term investments, and developers who actively contribute the confidence to continue their work, working together to ensure the Ascend ecosystem thrives in China.

Challenges of Inference for Ultra-Large MoE Models

DeepSeek V3, which has 671 billion parameters and adopts a hybrid expert architecture, performs excellently on various benchmarks. To some extent, it represents a new trend in the development of large models, i.e., model architectures based on software-hardware协同优化 can maximize the capabilities of hardware platforms and perform well in multiple tasks, including natural language understanding, code generation, and mathematical reasoning. For simplicity, we will refer to large models represented by DeepSeek V3 as ultra-large MoE models.

Despite outstanding performance and the availability of many open-source model weights as well as numerous tool projects such as DeepEP, enterprises looking to use these large models still face multiple challenges in deploying complete versions of ultra-large MoE models:

Firstly, higher requirements are placed on hardware deployment scale. Nowadays, when interacting with large models for chats, we are constantly using the inference of large models. Due to their own size, this can no longer be compared to running smaller-sized models on single machines with multiple cards or even single cards. Hardware clusters have become standard for "full-blooded" ultra-large MoE models.

Secondly, the massive size of the model places high demands on inference efficiency. The large number of experts poses significant challenges to the efficient use of hardware memory. Reasonable distributed parallelism and communication strategy design are required to effectively run such a large number of experts on hardware clusters.

Thirdly, the many architectural innovations in ultra-large MoE models also bring about practical deployment challenges. For example, its multi-head implicit attention mechanism (MLA - Multi Head Latent Attention), although it can compress the key-value pairs of the original attention mechanism through a projection matrix into a smaller implicit vector space, brings new challenges to operator optimization. This innovation leads to intermediate variable inflation and a significant increase in vector calculations, posing new requirements for hardware acceleration of computations.

Ascend Empowered Innovations for Large Model Cluster Inference

To solve the problems encountered in actual deployment mentioned above, we propose several affinity optimization strategies based on Ascend hardware and networking methods, developing a complete set of solutions for large-scale expert parallelism targeting clusters.

Ascend servers come in various configurations and models. We deploy them on two typical models recently released: CloudMatrix 384 Super Node and Atlas 800I A2 Inference Server. To decouple the latency constraints of the prefill stage's first token and the decode stage's decoding latency, we adopt a PD-separated deployment approach.

On the framework side, we base ourselves on the vLLM framework, making corresponding adaptations to DP and EP parallel strategies to accommodate Ascend servers. We employ Prefill scheduling buckets and Lingqu interconnection and hierarchical transmission techniques to reduce scheduling overheads. Performance optimizations are carried out in request issuance, scheduling strategies, system connection establishment, and front-end and back-end processing of the framework to achieve optimal system performance.

In terms of models, we adopt an A8W8C16 quantization strategy, where A8W8 uses INT8 data types and C16 uses BF16 data types for quantization. Regarding detailed deployment, due to differences in positioning and configuration between the two models, especially the huge difference in network configuration, specific deployment plans vary accordingly.

For the CloudMatrix 384 Super Node, its unique networking method provides it with powerful advantages. According to DeepSeek's paper, the decode part is heavily communication-dominated. With the help of MicroBatch technology, it can almost mask all other computational operations with communication. The networking of CloudMatrix 384 is very powerful, significantly reducing communication latency and further unleashing the computing power of Ascend chips.

Therefore, for super nodes, we deploy using a large-scale EP parallel approach; for prefill, we use 16 cards, and for decode, we use 144 cards. Among these, 128 cards deploy routing experts, and 16 cards deploy shared experts using DP. The MLA part is deployed using DP. Super nodes can achieve very high throughput. However, due to various factors, including the influence of latency constraints, parts of the time cannot reach ideal linearity. Bandwidth contention and startup overhead bring some performance degradation. Framework overhead and scheduling overhead add extra latency. The sequence load balancing of the MLA part and the expert load of the MoE part further worsen performance; ultimately, due to multiple reasons combined, current throughput achieves 1920 tokens/s per card decode under a 50ms latency guarantee.

For the Atlas 800I A2 server, since it is an 8-card server, we need to use multi-machine interconnect. Considering both model throughput and deployment flexibility, we selected a prefill example using 2 machines with 16 cards and a decode example using 4 machines with 32 cards. To maintain flexibility during deployment, fewer cards are selected here, leading to the adoption of a relatively small-scale EP parallel strategy: 8 routing experts and 1 shared expert are deployed on each card. The MLA part adopts DP parallel strategy, and the communication method uses the AllGather scheme. This deployment method can still achieve quite可观 throughput with fewer cards. It is worth mentioning that our communication solution uses AllGather rather than Dispatch/Combine communication schemes, which shows better performance under real loads. Under various optimization strategies, we achieved a single-card throughput speed of 808 tokens/s under a 100ms latency condition.

1. Optimization Technologies on Inference Framework Side

1) API Server Extension Technology

The team proposed API Server extension technology, supporting horizontal scaling strategies for API Servers, effectively improving the framework's request handling capacity, reducing user request latency, and increasing system throughput (QPS). Combined with network scheme optimization and full parallelism/full asynchronous front-end and back-end processing, it further achieves optimal TTFT, enhancing the availability and processing efficiency of inference services.

2) Load Balancing for MoE Models

The team proposed an efficient load balancing strategy, significantly improving MoE model inference performance through dynamic load balancing, hot expert redundancy deployment, real-time scheduling, and dynamic monitoring.

2. FusionSpec Inference Speculative Acceleration Technology

In practical applications, speculative inference focuses more on small-batch low-latency scenarios. How to efficiently apply it to high-throughput scenarios and maximize performance benefits has become a pressing technical challenge.

Speculative inference enhances the computational density of the model's decoding phase, naturally matching the characteristics of Ascend's high computation bandwidth ratio. To fully utilize Ascend's computational advantage and achieve high throughput in low-latency high-concurrency scenarios, we proposed deep optimization of FusionSpec speculative inference engine MTP on Ascend to improve inference performance:

  • On the inference process, speculative models are placed after the main model, directly using the output of the main model and reusing the control parameters of the main model, significantly reducing framework overhead and aligning with the PD-separated deployment scenario.
  • To further leverage Ascend's computational capabilities when speculative inference is enabled and reduce idle time for NPUs, we optimized the speculative inference framework, sampling (sampler) operations, and multi-head latent attention (MLA) calculations.

3. FlashComm Communication Optimization Technology

FlashComm: The mainstream tensor parallelism (TP) scheme using AllReduce for communication has issues such as frequent communication, large data volume, and high bit count of communication data format. Moreover, there is redundant computation such as residual connections and normalization calculations after AllReduce, failing to fully utilize the parallel capabilities of multiple cards. Therefore, we propose the FlashComm network communication scheme: Based on the same collective communication logic, we replace the AllReduce communication operators in tensor parallelism for the first three dense MLP layers of Deepseek network and arrange the positions of communication operators in the network, achieving low-bit and low-dimensional data communication, thereby effectively reducing communication data volume and latency while eliminating redundant computations in the network.

Intra-layer Parallel Conversion Technology: On the basis of FlashComm, to further optimize the latency of communication operators, we propose an intra-layer parallel conversion optimization solution: We redesign the parallel strategy used in the MLA layer of the network during the prefill phase, flexibly converting between tensor parallelism (TP) and data parallelism (DP), eliminating the need for node-level summation among cards and fully utilizing the low data dimensionality and quantization characteristics of the network to significantly reduce inter-node communication volume, thus significantly optimizing communication latency.

Compute-Communication Concurrency: Ascend chips provide mechanisms for compute and communication concurrency. During the computation process of MoE layers, AllGather is used to gather features of tokens on each card for screening and calculating active experts. In our solution, for the gate function, we use a method of computing first and then communicating for aggregation, and for shared experts, we use DP. This ensures that there are no dependencies between the computation and communication of the gate function, the computation of shared experts, and the AllGather function for feature aggregation. We utilize Ascend's multi-stream mechanism to handle these three parts concurrently, thereby maximizing the performance of the inference model. Specifically, for model deployment, designs can be made more detailed according to different needs. For example, to save memory more effectively, shared experts can adopt a machine internal TP and machine-to-machine DP approach, and the computation of shared experts can still be concurrent with intra-machine AllGather communication or other machine transfers for feature intra-machine communication.

Communication-Communication Concurrency: Ascend chips also provide mechanisms for communication and communication concurrency. When bandwidth utilization is low, two communication operators can be run concurrently to mask the startup overhead of communication operators and improve bandwidth utilization. When DeepSeek models perform communications like AllGather, Norm operators and quantization operators can be moved before the AllGather communication to reduce the amount of communication data and thereby improve communication efficiency. However, moving the quantization operator forward requires separate communications for quantized activations and scales, increasing communication operator startup overhead. Since the data volume of scales is small and occupies very little bandwidth, we use the communication-communication concurrency mechanism to run communications of activations and scales concurrently, masking the communication cost of scales without increasing the communication overhead of activations.

Communication and Weight Pre-fetching Concurrency: Ascend chips provide caching mechanisms. Operators prioritize finding data from caches during computation. If available, data is read directly from the cache; otherwise, it is read from HBM. Cache bandwidth is several times that of HBM. Since HBM bandwidth usage is low during communication operator execution, we can pre-fetch weights needed for subsequent operators into the cache during communication operator execution, thereby reducing weight transportation overhead during subsequent operator computations. Ascend chips also support flexible bandwidth limitation for pre-fetching, so the impact on communication performance during pre-fetching is minimal. For DeepSeek models, pre-fetching MLA weights matrices and KV caches at the end of MoE ReduceScatter can improve the computational performance of the MLA part.

4. Ascend-Affinity Innovative Operators

1) MLA Operator Optimization:

Attention Operator: Compared to traditional attention operators (such as MHA, GQA, which are significant bandwidth bottleneck operators), MLA, due to its intermediate variable expansion and significant increase in computational load, presents new challenges for operator optimization. Based on the architectural characteristics of Ascend processors, we have algorithmically reconstructed and hardware-affinity optimized the FA operators in the MLA scenario.

  • We propose the AMLA (Ascend MLA) algorithm, achieving additive equivalence transformation of multiplicative calculations through floating-point binary encoding parsing and atomic accumulation operations, enabling direct updating of O steps on Global Memory without entering Vector cores, greatly reducing the repeated movement of intermediate variables.
  • We meticulously plan L1 cache to minimize the repetitive process of data movement.
  • In engineering implementation, by optimizing the calculation flow to improve L2 cache hit rates and using strategies like K-buffer stream arrangement, we achieve mutual concealment between Cube calculation and Vector calculation, enhancing overall operator performance.

Meanwhile, although the current version of the model implementation does not use KVCache quantization algorithms, we have deeply reconstructed and optimized the MLA Attention calculation for scenarios involving only INT8 quantization of KV cache and full INT8 quantization of Q/K/V.

Preceding MLA Operators: For complex preceding MLA operators, we adopt different optimization strategies in the prefill and decode phases:

  • In the prefill phase, we implement pipeline concealment through dual-stream concurrency technology and increase support for various input-output modes in the FA operator to eliminate pure memory access redundancy operators.
  • In the decode phase, we absorb weights, deeply integrate preceding operators into the MLAProlog operator, and comprehensively optimize it for Ascend hardware architecture.

Specific optimization measures include: reducing pipeline bubbles through weight prefetching; adopting tiling strategies that minimize movement and maximize bandwidth; reducing instruction dependency and waiting through computational decoupling; eliminating full-core synchronization overhead through local computation fusion; using custom instruction sets for Ascend to compress ICache and avoid issue queue blocking risks, etc.

2) MOE Operator Optimization

Dispatch/Combine General Fusion Operator: In the EP deployment mode, experts in MoE are distributed across various cards in a larger communication domain. Each token needs to be dispatched to the corresponding card for computation. The original implementation uses InitialRouting to reorder all tokens based on expert sorting and then uses AllToAll and AllToAllv communication operators for token exchange. This implementation method has issues such as frequent communication and serious card synchronization overhead in scenarios with large communication domains, hindering the improvement of end-to-end latency throughout the network.

Therefore, we propose two general fusion operator technologies, MoeDistributeDispatch and MoeDistributeCombine: decomposing computation and transmission into token-grained computation units and realizing parallel execution of communication and computation through pipeline arrangement; utilizing memory semantics communication technology to directly transmit data to shared memory on different cards, thus reducing local copy and wait-for-data overhead; and reducing the number of data transmissions and card synchronization overhead through local memory selection and copy mechanisms.

SMTurbo-CPP Operator: Addressing the problem of low transmission efficiency for small data volumes in MOE layer large communication domain scenarios, we propose SMTurbo-Concurrent Push and Pull (SMTurbo-CPP) technology: optimizing communication operators AllToAll(v) at the memory semantic level, fully utilizing hardware concurrency capabilities, and using techniques like mixed read/write, aggregated pipelines, and batch detection to enhance thread memory access efficiency and throughput, significantly reducing latency in dispatch and combine scenarios.

Fine-grained Hierarchical Pipelining Algorithm: Based on the Atlas A2 series products, HCCL supports fine-grained hierarchical pipelining algorithms, greatly enhancing the execution efficiency of collective communication operators such as Allgather, ReduceScatter, and AlltoAll in the cluster. This algorithm utilizes the characteristics of A2 networking to achieve concurrent execution within/between servers, improving bandwidth utilization.

Performance Results

In April 2025, SiliconFlow, in collaboration with Huawei Cloud, officially launched DeepSeek-R1 using the CloudMatrix 384 Super Node Ascend cloud service and high-performance inference framework SiliconLLM. This service achieves a single-card decode throughput breakthrough of 1920 Tokens/s while maintaining a single-user 20 TPS level, comparable to H100 deployment performance. Additionally, after validation by mainstream test sets and large-scale online blind tests, the model accuracy of DeepSeek-R1 deployed on Ascend computing power remains consistent with the official DeepSeek model.

Conclusion

In the tide of booming artificial intelligence large model technology, Huawei,凭借其卓越的 computing power support and innovative architectural design, is injecting strong momentum into industry innovation. Huawei has released key technologies for Ascend inference series and will soon open source the technical codes, building an open and win-win developer ecosystem to promote the innovative development of large model technology and contribute core strength to China's autonomous artificial intelligence ecosystem.

Original article: https://www.toutiao.com/article/7506888189970268706/

Disclaimer: The article solely represents the author's views. Feel free to express your stance by clicking the 【Top/Downvote】 buttons below.