The domestic GPU is fully utilizing DeepSeek and can now reach 100 tokens per second!

Jin Lei from WAIC
Quantum Bit | Official Account QbitAI

Looking around now, which chip can run DeepSeek at full blood? The answer is surprising — it's not the one you might think, but a domestic GPU.

Because its speed has already reached 100 tokens/s!

This speed is an order of magnitude faster than foreign GPUs' 50 tokens/s and domestic ones' 15 tokens/s.

If all three are running simultaneously, the results will be even more obvious.

While the middle domestic GPU provides a complete and accurate answer effortlessly, the two sides are still in deep thinking:

So who is this domestic GPU?

No need to beat around the bush. It is Mozhuan Line.

But then, many people may ask, within less than five years since its establishment, how did Mozhuan Line achieve such speed?

After understanding the entire picture of "the way of computing power" from Quantum Bit, we found that the answer is much more grand and profound than just making a faster chip.

An AI Super Factory Has Been Built

Yes, because Mozhuan Line has built an AI Super Factory (AI Foundry) for computing power.

The word "Foundry" often makes people think of "wafer factories" when it comes to chip manufacturing. Its value depends on the yield, capacity, and process technology of chip production.

However, an AI Super Factory is not a physical factory that produces chips, but rather a metaphorical concept:

The evolution of this AI factory is like upgrading the process, and it's not just about changing a single technology. It's a systematic and comprehensive transformation.

It requires the entire technical stack to be completely reformed: from the fundamental chip architecture, to the overall cluster architecture design, and then to the software level — how algorithms are optimized to be smarter, and how resource scheduling runs more efficiently. Every link is crucial.

It is precisely this large-scale infrastructure overhaul that truly releases AI computing power, enabling large-scale "production" and "iteration" of cutting-edge AI large models.

A key point to emphasize is that building such a super factory is far from simply stacking thousands of graphics cards together.

It requires the tight coupling and co-evolution of five core elements; none of them can be missing;

The productivity of this AI factory can be summarized by a formula:

AI Factory Production Efficiency = General Purpose of Accelerated Computing × Effective Computing Power per Chip × Node Efficiency × Cluster Efficiency × Cluster Stability

Mozhuan Line has built a technological moat around these five elements.

Full-featured GPU: The Foundation of the Super Factory

The foundation of the AI Super Factory refers to a "full-featured GPU" with strong general-purpose capabilities. Because the history of computing power evolution is essentially the history of full-featured GPUs.

From the initial "graphics card" (VGA Card) that could only accelerate 3D graphics, to the "modern graphics processor" that allows developers to create infinite possibilities through open programming interfaces, and then to a general computing platform widely used in supercomputing, deep learning, and blockchain, each leap of the GPU came from the expansion of its generality.

Single-function accelerators, such as early 3D acceleration cards or certain specialized AI chips (ASICs) today, although highly efficient in specific tasks, have poor flexibility and are difficult to program, unable to adapt to the rapidly evolving AI models and the ever-increasing application scenarios.

An AI model may need to process language, understand images, and even simulate the physical world. If the factory's "machine tool" can only handle one task, it will soon be replaced.

Therefore, Mozhuan Line has consistently focused on creating a true full-featured GPU, which must be "functionally complete" and "precision complete."

Firstly, "functionally complete," meaning the chip integrates four core engines inside:

AI computing acceleration engine: Not only can it perform inference, but also training, achieving a unified training and inference capability.
Advanced 3D graphics rendering engine: Supports modern graphics APIs such as DX12, meeting visual computing needs for games, AIGC, and digital twins.
Physical simulation and scientific computing engine: This is a part often overlooked but crucial. Future Agentic AI and spatial intelligence require interaction with the physical world, and strong scientific computing capabilities serve as a bridge between the digital and physical worlds.
Ultra-high-definition video encoding and decoding engine: The computational results of AI ultimately need to be presented to humans through vision and hearing. High-definition, low-latency streaming processing capabilities ensure a good human-computer interaction experience.

Secondly, "full computing precision" coverage. From FP32, FP16 to industry-leading FP8, and even lower precision INT8/INT4, complete precision support allows developers to find the optimal balance between performance and precision based on different task requirements.

Especially in large model training, mixed-precision training has become standard. However, Mozhuan Line is one of the few platforms in China that can provide FP8 training capabilities. The "full functionality" and "full precision" capabilities ensure that Mozhuan Line's GPU, as the "machine tool," can handle various AI model production orders.

MUSA Unified System Architecture: The "Chief Designer" of the Super Factory

If the full-featured GPU is the machine tool of the factory, then MUSA is the "chief designer" of the entire factory. A superior top-level architecture can determine a company's technical roadmap and development potential for ten years or more.

The core philosophy of MUSA is "One Architecture for Many Applications." It adopts an innovative multi-engine, scalable, and configurable unified system architecture, performing top-level design and unified management of functions such as computation, communication, memory, and scheduling within the GPU.

First, scalability. As the name suggests, the MUSA architecture can quickly trim and optimize chip configurations according to the needs of different customers and markets, greatly reducing the R&D costs of new chips.

Second, global resource sharing. Simply put, it connects all hardware resources—such as computing cores, memory, and communication—into a large resource pool and flexibly allocates them through intelligent scheduling.

This approach directly solves a major problem: previously, single-engine GPUs would often get stuck when running multiple tasks. Now, all resources are shared, and they are used as needed!

For example, the unified programming interface and instruction set allow developers to learn a single API and programming model to drive all hardware engines under the MUSA architecture, significantly lowering the development threshold and improving development efficiency.

In addition, the MUSA architecture contains multiple proprietary technologies developed by Mozhuan Line.

For instance, the "Transformer Engine" specifically designed for FP8, which can improve FP8 training performance by 30% compared to schemes without this engine; the unique ACE asynchronous communication engine, which allows computing and communication to proceed in parallel, solving the pain point in traditional architectures where communication occupies computing resources, reducing computing resource loss by 15% and releasing GPU computing power; the self-developed MTLink2.0 interconnection protocol, which realizes efficient and low-latency communication between GPUs, providing 60% higher bandwidth than the industry average in China, laying a solid foundation for large-scale cluster deployment.

The advancement of the MUSA architecture ensures that each of Mozhuan Line's chips is not an isolated computing unit, but a highly collaborative and efficiently managed "combat team," effectively enhancing the effective computing power of each chip and providing a solid and scalable computing base for the entire AI Super Factory.

MUSA Full-Stack System Software: The "Operating System" and "Toolbox" of the Super Factory

Even the best hardware cannot fully realize its potential without efficient software. Therefore, Mozhuan Line has developed a full-stack software system that is deeply integrated with the MUSA hardware architecture. It plays the role of an "operating system" and "developer toolbox" in the AI Super Factory.

This software stack covers everything from the bottom layer drivers to the upper-layer application frameworks:

Efficient Drivers: Mozhuan Line's drivers have been deeply optimized, reducing kernel function startup time by 50%, with extremely low task dispatch latency, capable of handling thousands of tasks concurrently at once, leading the industry.
Core Operator Library: Compared to international giants' cuDNN, Mozhuan Line's muDNN has made numerous optimizations at the operator level, achieving 98% utilization of GEMM operators and over 95% utilization of Flash Attention operators.

Communication Performance Boost: MCCL training communication library achieves 97% bandwidth utilization on RDMA networks; optimizing compute communication parallelism using the asynchronous communication engine improves cluster performance by 10%.
Ecosystem Compatibility and Triton Support: Through tools like MUSIFY, seamless support for mainstream AI frameworks such as PyTorch and TensorFlow is achieved. Notably, using the Triton-MUSA compiler + MUSA Graph achieves a 1.5x acceleration for DeepSeek R1 inference, fully compatible with Triton and other mainstream frameworks.
Comprehensive Developer Toolkit: Provides a complete set of tools covering performance analysis (Profiler), debugging, optimization, and one-click deployment, like a "treasure chest," allowing developers to gain insight into every detail of hardware operation and extract every bit of performance from the hardware.

This full-stack system software ensures that developers not only "can use it" but also "use it well," smoothly passing the powerful capabilities of the MUSA hardware architecture to the upper-layer applications, serving as a critical link between hardware and algorithms. Moreover, through the optimization of the MUSA full-stack system software, Mozhuan Line has achieved a comprehensive improvement in "single-node computing efficiency."

KUAE Computing Clusters: The "Production Workshop" of the Super Factory

Even if the performance of a single card or node is strong, it cannot complete the training of billion or trillion parameter large models. An AI Super Factory must exist in the form of a large-scale cluster. For this reason, Mozhuan Line has built the Kua'e (KUAE) large-scale intelligent computing cluster.

The Kua'e computing cluster is far more than just a simple server stack. It is a soft-hard integrated system engineering, equivalent to the "production workshop" of large AI models:

Soft-Hard Integrated Design: From server nodes, switches, cabinets, to the upper-level cluster management software and task scheduling systems, all have been collaboratively designed and optimized.
Innovative 5D Parallel Training: Mozhuan Line integrates all mainstream parallel training strategies, including data parallelism (DP), pipeline parallelism (PP), and tensor parallelism (TP), fully supporting mainstream architectures such as Transformer, and automatically searching and recommending the optimal parallel scheme based on the model characteristics.
End-to-End Training Optimization: Covers the entire process from data preprocessing, model pre-training, reinforcement learning, fine-tuning, to validation and evaluation, providing a one-stop service.
Performance Simulation Tool (Simumax): A self-developed Simumax tool that automatically searches for the optimal parallel strategy for ultra-large clusters, accurately simulating FP8 mixed-precision training and operator fusion, providing scientific basis to shorten the training cycle for models like DeepSeek.
Efficient Checkpoint: To address the stability challenges of large models, the innovative Checkpoint acceleration solution uses RDMA technology to compress the backup and recovery time of hundreds of gigabytes to one second, enhancing the utilization rate of GPU effective computing power.

Through the Kua'e computing cluster, Mozhuan Line has successfully expanded the single-card performance advantage to the level of thousands or tens of thousands of cards, building a truly powerful "productivity" AI Super Factory. And through testing, the KUAE 2 large-scale intelligent computing cluster has already reached industry-leading levels in MFU for different architectural models.

Zero-Interrupt Fault Tolerance Technology: The "Safety Production Protocol" of the Super Factory

For an AI Super Factory that needs to operate continuously 24/7, stability is paramount. An unexpected downtime could mean millions of dollars in losses and weeks of work being wasted. Therefore, Mozhuan Line has developed its own "Zero-Interrupt Fault Tolerance Technology," which serves as the "safety production protocol" to ensure stable operations of the factory.

Traditional fault tolerance mechanisms, when hardware (such as a GPU card) fails, require pausing the entire training task, manually replacing the hardware, and then resuming from the latest checkpoint, which is time-consuming and labor-intensive. In contrast, Mozhuan Line's zero-interrupt technology is completely different:

Zero-Interrupt Fault Tolerance Technology: When a node slows down or experiences a failure, only the affected node group is isolated, while the rest of the nodes continue training, with the backup node seamlessly joining in, ensuring no interruption throughout. This solution increases the effective training time of the KUAE cluster to over 99%, significantly reducing recovery costs.
Multi-dimensional Training Insights: By monitoring and predicting data across multiple dimensions, the system can anticipate which nodes might become "slow nodes" and issue warnings or isolate them, achieving dynamic monitoring and intelligent diagnosis, with abnormal handling efficiency improved by 50%;
Cluster Self-Inspection and Scheduling Optimization: Before the training task starts, the system automatically performs a "health check" on the entire cluster, ensuring all software and hardware are in optimal condition, similar to the safety inspection before an airplane takes off, increasing the training success rate by 10% and providing a stable guarantee for large-scale AI training.

In summary, the five elements — full-featured GPU, MUSA architecture, full-stack software, KUAE cluster, and zero-interrupt fault tolerance technology — together constitute Mozhuan Line's AI Super Factory.

It is an organic whole, interlocking from the most basic chip design to the highest level of cluster management, evolving in harmony. It is this complete, end-to-end system that has created the performance mentioned at the beginning of the article.

Then comes the next question:

Why Build an AI Super Factory?

The answer to this question may stem from Mozhuan Line's profound insights into the past, present, and future of the computing revolution.

Ten years ago, the explosive growth of "perceptual AI" represented by facial recognition and autonomous driving gave rise to the first wave of AI giants. Since ChatGPT made a breakthrough in 2022, we are currently experiencing an exponential growth phase of "generative AI."

The "intelligence" iteration speed of large models is astonishing. Last year, they were still at around 40-50 points, and now top models have surged to 70-80 points, approaching human peak levels.

The iteration speed of models has compressed from months to weeks, even weekly updates. Behind this competition, the driving force is only one thing — computing power.

Just like Musk was able to let the Grok model climb to the top of the rankings in a short time with 200,000 H100s, this harshly reveals a fact: the Scaling Law is the iron law of AI development.

Whoever has a larger and stronger computing infrastructure can iterate models faster and seize the technological and market high ground.

Looking ahead to the next five years, Agentic AI (agent AI) and spatial intelligence will become the next breakthrough points. AI will no longer just be a chatting tool, but a "digital employee" capable of independently completing complex tasks, and will integrate deeply with the physical world.

All of this means that the demand for computing power will grow geometrically. Under such circumstances, merely meeting current computing capabilities is far from enough; we must prepare for the more massive computing demands of the future.

Face with endless computing demands, pursuing "speed" alone is one-sided. Future computing requires a comprehensive "stability" — stability, reliability, efficiency, and versatility.

This is the fundamental reason for building an AI Super Factory.

Training a trillion-parameter large model is like building the Hong Kong-Zhuhai-Macao Bridge, a highly complex systems engineering project. Its infrastructure requirements are comparable to building a chip wafer factory.

You can't expect to use "human tactics" and have ten million children lift a building; similarly, you can't simply pile up ten thousand inefficient GPUs and expect to train a high-quality large model.

This process is full of challenges. For example, in terms of cost, a large-scale training can cost months and millions of dollars, and any interruption or failure is a huge loss.

Also, facing complex systems, how to efficiently communicate and synchronize thousands of nodes and tens of thousands of chips? How to perfectly match software and hardware? How to quickly locate and solve problems?

And in practical applications, tasks are often diverse: today training a language model, tomorrow processing multimodal data, and the day after tomorrow conducting scientific computing...

These challenges cannot be solved by simply purchasing the fastest chip. It requires a comprehensive solution from the bottom hardware to the upper software, as well as cluster management and maintenance services.

This is exactly the core value of Mozhuan Line's "AI Super Factory" — it provides not isolated computing power, but a deterministic, efficient, and high-success-rate AI model production capability.

In summary, Mozhuan Line has chosen the most difficult, but possibly the correct path. Instead of merely catching up or surpassing in a single point, they have looked to the future and fundamentally considered how to provide the most advanced "productive tools" for this era.

This is the answer given by Mozhuan Line — an answer that goes beyond speed and looks to the future.

— End —

Quantum Bit QbitAI · Toutiao

Original Article: https://www.toutiao.com/article/7531332498740085283/

Declaration: This article represents the views of the author and is welcome to express your opinion by clicking on the 【Up/Down】 buttons below.