USA – At the recent Open Compute Project (OCP) Global Summit in San Jose, Nvidia revealed its decision to contribute vital aspects of its Blackwell accelerated computing platform to the OCP.
This initiative aims to enhance collaborative development and access to advanced computing technologies within the data center community.
This ambitious project will feature 64 GB200 NVL72 racks and 4,608 Tensor Core GPUs, aimed at driving innovations in cancer research, LLM development, and smart city projects, with full deployment expected by 2026.
A notable project utilizing the GB200 NVL72 is the Hon Hai Kaohsiung Super Computing Center in Taiwan, in collaboration with Foxconn.
Nvidia’s contribution includes key design elements of its full rack Blackwell system, the GB200 NVL72.
This encompasses details on rack architecture, compute and switch tray mechanics, specifications for liquid cooling and thermal environments, and volumetric data for the NVLink cable cartridge.
GB200 NVL72: A powerful computing solution
The GB200 NVL72 rack is designed to facilitate the training of large language models (LLMs) with up to 27 trillion parameters.
It features 36 GB200 Grace Blackwell superchips, interlinked with 36 Grace CPUs and 72 Blackwell GPUs, delivering a remarkable 720 petaflops of training performance and 1.4 exaflops of inferencing capability.
The system employs liquid cooling technology and utilizes NVLink, which boasts a bandwidth of 1.8 TB/s, allowing the entire system to function as a singular massive GPU.
The Blackwell GPU, which is central to this system, is Nvidia’s latest innovation. Comprised of 208 billion transistors and manufactured using TSMC’s 4nm process, a single Blackwell GPU can train a 1-trillion-parameter model.
Nvidia claims this new chip offers up to 30 times the speed of its predecessor, the Hopper GPU. Notably, the energy efficiency of the Blackwell GPUs also stands out.
For instance, while training a 1.8 trillion parameter model previously required 8,000 Hopper GPUs and consumed 15 megawatts of power, the Blackwell GPUs can accomplish this with just 2,000 units, consuming only 4 megawatts.
Enhancing open standards and interoperability
Nvidia has a history of contributing to OCP, having previously introduced the NVIDIA HGX H100 baseboard, which has become a standard in AI servers, and the NVIDIA ConnectX-7 adapter, which serves as the foundation for the OCP Network Interface Card (NIC) 3.0.
Nvidia also announced plans to expand NVIDIA Spectrum-X support for OCP standards. According to Nvidia CEO Jensen Huang, this collaboration aims to establish specifications and designs that can be widely adopted across data centers, thereby enhancing the overall landscape of accelerated computing.
By promoting open standards, Nvidia is working to maximize the potential of AI technologies for organizations worldwide.
Bridging the AI hardware gap
Nvidia’s contributions reflect a growing trend toward open hardware in the AI and high-performance computing (HPC) arenas.
By sharing parts of its Blackwell platform, Nvidia aims to make advanced computing technologies more accessible, which can improve interoperability with other open systems.
This initiative supports the expansion of the AI and HPC ecosystem, providing developers with more opportunities to leverage cutting-edge computing solutions for scientific research and large-scale applications.
As AI models become increasingly complex, particularly with the emergence of multi-trillion parameter models, the demand for scalable computing infrastructure rises.
By democratizing access to AI technology, Nvidia’s contributions can help ensure that the necessary hardware for training and deploying these models is available to a wider range of organizations, particularly in scientific fields where robust computing resources are essential.
Major developments in AI infrastructure
In a related event at Lenovo Tech World 2024, Jensen Huang partnered with Lenovo CEO Yuanqing Yang to unveil the Lenovo Hybrid AI Advantage, a comprehensive platform designed to enhance AI capabilities across enterprises.
This initiative aims to combine Lenovo’s infrastructure with Nvidia’s accelerated computing solutions to foster innovation and productivity.
The two leaders emphasized the importance of energy-efficient AI infrastructure, underscoring that speed directly correlates with performance and sustainability.
Lenovo’s 6th Generation Neptune Liquid Cooling solution exemplifies this commitment, significantly reducing power consumption while supporting AI workloads.