Introduction to the technical specifications of H200, whether to buy it or not, look at the specifications first

1. China not buying H20 is definitely a wise decision. This cut-down version significantly reduces performance while the price remains high. The reasoning computation that it excels in is being actively developed by domestic chips. Whether to buy H200 can be determined by looking at the specifications. These specifications are not very easy to understand and need to be sorted out.

2. H200 is an upgraded version of H100, released in March 2024, which is not long ago. In the top 20 GPU clusters of data centers worldwide, 18 use H200 as the main computing chip. H200 is not made by combining two H100s, but rather by increasing the VRAM of H100, with the same core computing units. H200 has 141GB of HBM3e VRAM, with a memory bandwidth of 4800GB/s; while H100 has 80GB of HBM3 and a bandwidth of 3.35 TB/s. This makes the actual performance of H200 significantly better than H100, for example, in reasoning tasks, it can achieve up to twice the performance of H100. Simply put, large model training requires more computing power, while reasoning requires more memory bandwidth.

3. As shown in the figure, these are the indicators listed by the U.S. IFP (Institute for Progress, which advocates banning chip exports to China). Before H20, some Chinese companies bought a lot, mainly because the bandwidth indicator of 4000GB/s was quite good, and the performance was acceptable for large model reasoning applications. However, H20's computing power is only 296 TFLOPS, while H200 is 1979 TFLOPS. H20's computing power has been reduced to one-sixth, which is extremely poor, and it has been severely cut. Domestic AI chips have relatively lower difficulty in solving reasoning applications, so not buying H20 is technically also very reasonable.

4. Another comparison is B30A, a downgraded version of NVIDIA's Blackwell chip, previously rumored to be exported to China. Its bandwidth is 4000GB/s, the same as H20, but its computing performance is 2500 TFLOPS, higher than H200. In specific FP4 applications (using 4 bits to represent floating points), it reaches 7500 TFLOPS, although such applications are difficult. FP8 is not easy to use.

5. A more comprehensive comparison of indicators is shown in Figure 2. The most powerful chip from NVIDIA is currently B300. Sometimes it is claimed that B300 is much stronger than H200, with four times the computing power, or even up to 30 times in extreme cases. But looking at the figure, it mainly shows that B300 has seven times the computing power of H200 in FP4 scenarios because H200 does not have special support for FP4. In fact, many large models do not use FP4 (training with such low precision is unlikely), and instead use FP8 mainly. In FP8 scenarios, B300's computing power is 2.5 times that of H200, but it actually contains two GPUs, while H200 has only one, so the real computing power improvement is not that significant. From the price perspective, the computing power value of B300 in FP8 scenarios is 94, which is 42% higher than H200's 66. As for the price bandwidth value, all chips are similar, and it is not the key factor anymore.

6. Therefore, from a technical perspective, China should be able to effectively supplement the computing power in FP8 scenarios by purchasing H200. DeepSeek uses 2048 H800s as its main computing chips. Their computing power indicators are similar to those of H100 and H200, but their VRAM bandwidth has been cut, and the NVLink bandwidth has been halved, limiting the training performance. However, through extremely sophisticated algorithm and data structure optimization, DeepSeek achieved great breakthroughs, catching up with the strongest American model at the beginning of the year. The United States is skeptical that DeepSeek only used so few chips, with rumors suggesting 10,000 H100 and 10,000 H800 cards. Regardless, even according to the numbers the United States believes, the training computing power owned by Chinese companies is seriously insufficient, while American companies have scales of 100,000 H200 cards.

7. Another factor is energy consumption, where there is a big difference between China and the U.S. In terms of training computing power, B300 does not have a particularly significant advantage over H200. However, in terms of energy consumption, using FP4 low-precision computing for AI reasoning and architectural optimization, the energy consumption per token is 30-50% lower than that of H200. The biggest headache for U.S. data centers is energy consumption, not lack of cards but lack of electricity, so there is a strong demand to replace H200 with B300. China does not lack electricity, so H200 is appropriate. The U.S. may have plans to sell used H200s to China, thus recouping some funds.

8. These data specifications are complex, and the considerations behind them cannot be decided in a few sentences. Personally, I feel that China should import some H200, as training computing power is needed.

Original article: toutiao.com/article/1851088218903563/

Statement: This article represents the personal views of the author.