By Observer Net, Lv Dong
The lack of excitement from GPT-5 has made many realize that traditional Scaling Law (scaling law) has encountered obvious bottlenecks. From the perspective of application demand, more companies are beginning to focus on the performance experience of model inference, which is related to commercial deployment and monetization.
However, China is facing bottlenecks in this critical link. Not only is the investment in infrastructure far less than that of the United States, but it also faces challenges such as the cutting of computing cards, the rising price and embargo of HBM (high-bandwidth memory). Especially, with the continuous expansion of AI application scenarios, the demand for long-text processing, multi-turn dialogue, and complex business processes for inference is increasing, which further highlights the difficulties in China's AI inference.
Under the reality of these challenges, Huawei has launched a "black technology" called UCM (Unified Cache Manager, Inference Memory Data Manager), an AI inference acceleration solution. This breakthrough technology, through innovative architectural design and storage optimization, breaks through the HBM capacity limit, enhances the inference performance of domestic large AI models, and improves the key links of the Chinese AI inference ecosystem.
At a time when NVIDIA is facing a trust crisis due to a "backdoor," Huawei has actively opened up UCM as open source, enabling collaboration across framework, computing power, and storage layers, promoting the national AI inference to move away from "card dependency" and toward a positive cycle of "experience improvement - user growth - enterprise investment increase - technical iteration." This technological breakthrough around "memory" may be the crucial battle for the practical application of the Chinese AI industry.
Inference has become critical, and China's bottlenecks are evident
The vigorous development of AI technology has made large model training a cost center, but the real value creation comes from the inference process.
Data shows that the current AI inference computing power demand has exceeded training. The API call volume of GPT-5 reached over 2 billion times per minute in its first week of opening, with 70% of requests being complex reasoning tasks (such as code generation, multi-step planning, etc.), while the daily token call volume of volcanic engine has reached 16.4 trillion, with over 70% coming from online inference rather than training.
Inference performance is related to user experience and commercial feasibility, and has become a key factor for AI implementation. However, as AI industry implementation deepens, inference capabilities continue to face challenges, especially under the circumstances where the demand for long-text processing, multi-turn dialogue, and complex business process inference is increasing, the requirements for inference performance have become even stricter.
Under this context, a key technology known as Key-Value Cache (KV Cache) emerged, which can optimize computational efficiency and reduce redundant computations. It temporarily stores the Key (representing the features of historical input) and Value (the reference information used to generate the current output based on the Key) of already generated tokens, allowing subsequent generation of new tokens to directly reuse them without recalculating, significantly improving inference efficiency.
But the problem is that KV Cache requires GPU's video memory (such as high-bandwidth memory HBM) to store historical Key/Value vectors. The longer the generated text, the larger the cache data volume, which could potentially cause HBM and DRAM to be overwhelmed.
Chinese enterprises are not comparable to American ones. On one hand, Chinese internet companies invest only one-tenth of what the US does in AI infrastructure, and small and medium-sized enterprises have limited budgets and cannot afford many high-end HBM. On the other hand, China is also facing export controls, unable to obtain the most advanced computing cards and HBM, and cannot indefinitely stack cards.
More importantly, with the massive data of PB-level large models, the bottleneck of traditional inference architecture that overly relies on HBM is becoming increasingly evident. With the arrival of the Agentic AI era, the expansion of model scale, the sharp increase in long-sequence demand, and the growth in the number of concurrent inference tasks, the capacity growth of KV Cache for inference has exceeded the carrying capacity of HBM, causing frequent memory overflow, leading to "forgetting" during inference, requiring repeated calculations by GPUs, resulting in lag and delay.
Under various challenges, domestic large models are facing the difficulties of "not pushing forward," "pushing slowly," and "pushing at a high cost."
Data shows that the output speed of foreign mainstream large models is in the range of 200 tokens/s (delay 5ms), while China is generally less than 60 tokens/s (delay 50-100ms), with a maximum gap of 10 times. In terms of context window size, overseas models generally support up to 1 million tokens (such as GPT-5, Claude 3.5), while domestic top models (Kimi) only support 500,000, and in long-text analysis, the probability of domestic models missing key information exceeds 50%.
This experience obviously hinders the large-scale implementation of China's AI. If left unchecked, it could lead to a vicious cycle in commerce, further reducing corporate investment and slowing down investment in AI, widening the gap between China and foreign countries in the international AI competition.
How to significantly improve the inference experience and promote AI inference into a commercial positive cycle without greatly increasing the investment in computing infrastructure has become China's urgent task.
Huawei's "Black Technology" Breaks Through Inference Experience Bottlenecks
As mentioned earlier, in the "Token Economy" era, KV Cache and memory data management are the core of optimizing inference performance and reducing computational costs. But HBM, this high-performance memory, is too expensive and cannot be endlessly stacked, while SSD (Solid State Drive) has a slow transfer rate, seemingly forming a "impossible triangle" of cost, performance, and effectiveness.
Can we classify data according to the "heat" of memory in different storage media like HBM, DRAM, and SSD, so that the model can remember more KV Cache data and call it more intelligently and quickly? Like humans, you can store "memories" in the brain, books, and computers, and retrieve them quickly as needed.
Huawei's newly released "black technology" UCM follows a similar approach.
UCM stands for "Unified Cache Manager," an inference acceleration package centered around KV Cache. It integrates multiple types of cache acceleration algorithm tools, which can manage the KV Cache memory data generated during inference, expand the inference context window, and achieve high throughput and low latency inference experience, reducing the cost per token for inference.
For example, to solve the problem of "inference being slow," UCM caches historical processed results, historical conversations, corpus, and RAG knowledge base data in the form of KV Cache to high-performance external shared storage in the third layer. When information that has been inferred and cached is encountered, it doesn't need to be re-inferred, but instead can be queried and called directly from the external storage, achieving significant inference acceleration, reducing the first token delay by 90%, and saving the time of token-by-token processing.
With this capability, large models can remember more historical content and conversations, avoiding "redundant labor." Previously, generating content took 10 seconds, but now it might take just 1 second, significantly improving the inference experience.
This is not all of this "black technology."
Those who follow large models know that as inference tasks become longer, long-sequence inference often makes large models have only seven seconds of memory. For example, when analyzing a ten-thousand-character article, due to the limited capacity of HBM, it may not be able to cache the first 2000 characters, leading to inference failure or loss of key related information, creating a "not pushing forward" dilemma.
How did Huawei solve this?
UCM breaks through with a series of intelligent algorithms, slicing long sequence content and offloading the processed slices to larger DRAM or external shared storage, effectively expanding the capacity of HBM, allowing the context window to expand 10 times and meeting the needs of long-sequence inference. In other words, the model's "memory ability" has improved from remembering "three pages of paper" to "thirty pages of paper."
More importantly, Huawei uses attention sparsity and related technologies to identify the importance, relevance, and heat of a large amount of KV Cache data, classifying and storing them hierarchically. In the next inference process, only the key and appropriate vectors need to be extracted, which reduces the number of vectors in vector inference, thus improving the overall throughput.
"For KV data used for inference, there will definitely be hot, warm, and cold. It's impossible to use the most expensive medium to store all data. We have a deep understanding of storage systems. Each type of data has this characteristic and a lifecycle. We must use multiple layers of media to solve performance issues and balance cost issues," said a Huawei technical expert to Observer Net.
With the deep integration of storage and computing capabilities, through the balance of performance and cost using multiple media, the problem of "inference being expensive" is no longer a challenge. Huawei stated that without excessive investment, UCM can increase TPS (tokens per second) by 2-22 times in long-sequence scenarios, equivalent to reducing the cost per token inference, thereby alleviating the burden on enterprises and improving efficiency.
The significance of UCM is more like another "system order point" for Huawei. It is not to replace HBM, but to reduce the reliance on HBM and make full use of HBM's advantages in more suitable places.
With this technological support, enterprises can maintain their computing power investment unchanged, spending only a small portion on external storage investment, allowing caching to "upgrade" in place, improving inference efficiency, and spreading the cost per token inference, thus forming a positive cycle of "increased user traffic - enterprise revenue - further expanded AI investment - rapid technical iteration," driving the overall improvement of China's AI level.
Joint Innovation Verifies Technical Value
Any technology can only produce value once it is truly implemented. After the launch of Huawei's UCM, it has already partnered with China UnionPay to conduct pilot applications of UCM technology in typical financial scenarios.
Why was the financial scenario chosen first?
A Huawei technical expert told Observer Net that there are three core challenges in large model inference in the financial industry. First, "not pushing forward," regardless of production environment research and analysis or public opinion analysis, there are a lot of long-sequence inputs. For example, a research report may be up to terabytes, and precise marketing requires input context that is basically long sequences, which is prone to losing key information; second, "inference being slow," the core is that concurrency cannot be increased, and after increasing, the delay per token is particularly long; finally, "inference being expensive," the reason is that a lot of computing power is consumed, and repeated calculations are done for KV Cache.
"The problem is long-sequence inference. When we communicate with customers, the conversation duration is very long, and after converting to text, it forms a large amount of historical conversation and content. Using the KV Cache method will occupy our video memory, and the bottleneck becomes the video memory because we need to cache a lot of KV Cache," said a relevant person from China UnionPay.
Therefore, Huawei and China UnionPay conducted joint innovation with UCM technology. On one hand, they offloaded the computed KV Cache data from the video memory to memory and storage, easing the pressure on the video memory and enabling it to handle longer sequence data. On the other hand, they used attention sparsity technology to allow large models to distinguish which data in the KV Cache cache is most relevant to this inference, and only extract the key vectors, which can reduce the inference time and improve the throughput.
It is in this joint innovation technology pilot that the technical value of UCM was fully verified.
In the "Customer Voice" business scenario of China UnionPay, with the help of UCM technology and engineering methods, the large model inference speed increased by 125 times, and it only takes 10 seconds to accurately identify the customer's frequent questions. In the "Marketing Planning" scenario, a marketing plan that previously took several minutes to generate can now be completed within 10 seconds, and a single server can support more than five marketing personnel working online simultaneously. In the "Office Assistant" scenario, for a super-long meeting audio of over 170,000 tokens, UCM can easily cope with transcription and summary generation, getting rid of the "not pushing forward" dilemma.
So, can UCM be applied to other scenarios in the future to promote the implementation of AI in various industries? A Huawei technical expert gave a positive response.
"With the arrival of the Agentic AI era, the explosion of information is reflected on the model side as insufficient video memory and the issue of inference token cost. UCM solution is to solve this kind of problem, not a single point, but first applied in the financial industry. In the future, once AI truly delivers value in all industries, it will all go into this field," he told Observer Net.
Filling the Ecological Gap, Huawei Open-Sources Again
As the importance of inference performance continues to rise, the industry is also exploring KV Cache hierarchical caching management technology. For example, NVIDIA launched a distributed inference service framework Dynamo in May this year, supporting the offloading of KV Cache from GPU memory to CPU, SSD, or even network storage, solving the GPU memory bottleneck of large models and avoiding repeated calculations.
But currently, NVIDIA is facing a "backdoor" controversy, and the industry is calling for Chinese solutions more than ever.
"Many people are doing hierarchical caching management, but the biggest differentiation of UCM is incorporating professional storage," said a Huawei technical expert to Observer Net. Huawei builds differentiated capabilities in hierarchical caching management through extensive software-hardware collaboration and offloading, such as KV retrieval indexes and the lifecycle management of KV Cache. "Many industry solutions only have traditional Prefix Cache in the algorithm acceleration library, but today we have commercialized a full process of sparsity, suffix retrieval algorithms, and other series of algorithms. Therefore, in the algorithm library, we have contributed much more, more rich, more reliable, and better accelerated algorithms, and this algorithm library is continuously increasing," he added.
Those who follow large models may know that Huawei previously open-sourced the AI framework MindSpore, and recently also open-sourced CANN, a neural network heterogeneous computing architecture comparable to NVIDIA CUDA. This time, Huawei again announced that it will open-source UCM in September.
From an architectural perspective, Huawei's UCM solution mainly consists of three components: a推理 engine plugin (Connector) that connects different engines and computing power, a function library (Accelerator) that supports multi-level KV Cache management and acceleration algorithms, and a high-performance KV Cache access adapter (Adapter), realizing the collaboration of the inference framework, computing power, and storage layers.
Through open north-south interface, UCM can adapt to multiple types of inference engine frameworks, computing power, and storage systems. For developers of inference engine frameworks, the open interfaces of UCM enable them to integrate UCM technology into their own frameworks more conveniently, enhancing the performance and competitiveness of the framework. For computing power and storage system manufacturers, the open-source of UCM provides new development opportunities, prompting them to develop more suitable products and solutions to better work with UCM.
"We do not want to control this framework or build differentiated capabilities on the framework, we want this ecology to flourish and truly solve the problem of AI inference. After Huawei's UCM is released today, it has contributed to the entire inference framework, inference experience ecology, and cost aspects. In the future, I believe more players will contribute to it, it's a process of co-creation and prosperity," said a Huawei technical expert to Observer Net.
From a more long-term perspective, Huawei's release and open-sourcing of UCM is still a victory of the system engineering, which will promote the Chinese AI industry into a virtuous cycle of "experience improvement - user growth - investment increase - technical iteration." It is not to replace anything, but to break through the key bottlenecks of the industry implementation of AI, injecting stronger momentum into the long-term development of China's AI industry.
Source | Observer Net
Original: https://www.toutiao.com/article/7538370038038790671/
Statement: This article represents the personal views of the author. Please express your attitude by clicking the 【Up/Down】 button below.