Nvidia's next-generation Blackwell AI chips face serious overheating issues when installed in high-capacity server racks. These issues have led to design changes and delays, and have raised concerns among customers such as Google, Meta, and Microsoft about the timely deployment of Blackwell servers.
Insiders have revealed that Nvidia's Blackwell GPU can overheat when used in servers with 72 chips. These devices are expected to consume up to 120kW of power per rack. These issues have forced Nvidia to re evaluate its server rack design multiple times, as overheating can limit GPU performance and pose a risk of component damage. Customers are concerned that these setbacks may hinder their schedule for deploying new chips in data centers.
According to reports, Nvidia has instructed its suppliers to make multiple design changes to the rack to address overheating issues. The company works closely with its suppliers and partners to develop engineering revisions to improve server cooling. Although these adjustments are standard practice for such a large-scale technology release, they still add delays and further delay the expected delivery date.
According to a report by First Financial, in response to delays and overheating issues, a spokesperson for NVIDIA stated, "We are working with leading cloud service providers as an essential part of our engineering team and processes. Engineering iterations are normal and expected. Integrating GB200, the most advanced system to date, into various data center environments requires joint design with our customers." NVIDIA also stated that "customers are currently seizing the market opportunity for GB200 systems.
Previously, Nvidia had to postpone the production of Blackwell due to design defects in chip yield. Nvidia's Blackwell B100 and B200 GPUs use TSMC CoWoS-L packaging technology to connect their two chips. This design includes an RDL intermediate layer with LSI (Local Silicon Interconnect) bridge, supporting data transfer speeds of up to 10TB/s. The precise positioning of these LSI bridges is crucial for the technology to operate as expected. However, the mismatch in thermal expansion characteristics between GPU chips, LSI bridges, RDL interlayers, and motherboard substrates has led to warping and system failures. To address this issue, Nvidia modified the top metal layer and bump structure of GPU silicon to improve production reliability.
Therefore, the final revised Nvidia Blackwell GPU will only begin mass production in late October, which means Nvidia will be able to ship these chips from late January next year.
NVIDIA's clients, including tech giants such as Google, Meta, and Microsoft, use NVIDIA GPUs to train their most powerful large language models. The delay of Blackwell AI GPU will naturally affect Nvidia's customers' plans and products.