Artificial intelligence (AI) is here, and it is here to stay. “Every industry will become a technology industry,” according to NVIDIA founder and CEO, Jensen Huang. The use cases for AI are virtually limitless, from breakthroughs in medicine to high-accuracy fraud prevention. AI is already transforming our lives just as it is transforming every single industry. It is also beginning to fundamentally transform data center infrastructure.
AI workloads are driving significant changes in how we power and cool the data processed as part of high-performance computing (HPC). A typical IT rack used to run workloads from 5-10 kilowatts (kW), and racks running loads higher than 20 kW were considered high-density – a rare sight outside of very specific applications with narrow reach. IT is being accelerated with GPUs to support the computing needs of AI models, and these AI-chips can require about five times as much power and five times as much cooling capacity1 in the same space as a traditional server. Mark Zuckerberg announced that by the end of 2024, Meta will spend billions to deploy 350,000 H100 GPUs from NVIDIA. Rack densities of 40 kW per rack are now at the lower end of what is required to facilitate AI deployments, with rack densities surpassing 100 kW per rack becoming commonplace and at large scale in the near future.
This will require extensive capacity increases across the entire power train from the grid to chips in each rack. Introducing liquid-cooling technologies into the data center white space and eventually enterprise server rooms, will be a requirement for most deployments as traditional cooling methods will not be able to handle the heat generated by GPUs running AI calculations. Investments to upgrade the infrastructure needed to power and cool AI hardware are substantial and navigating these new design challenges is critical.
The Transition to High-Density
The transition to accelerated computing will not happen overnight. Data center and server room designers must look for ways to make power and cooling infrastructure future-ready, with considerations for the future growth of their workloads. Getting enough power to each rack requires upgrades from the grid to the rack. In the white space specifically, this likely means high amperage busway and high-density rack PDUs. To reject the massive amount of heat generated by hardware running AI workloads, two liquid cooling technologies are emerging as primary options:
- Direct-to-chip liquid cooling: Cold plates sit atop the heat-generating components (usually chips such as CPUs and GPUs) to draw off heat. Pumped single-phase or two-phase fluid draws off heat from cold plate to send it out of data center, exchanging heat but not fluids with the chip. This can remove about 70-75% of the heat generated by equipment in the rack, leaving 25-30% that air-cooling systems must remove.
- Rear-door heat exchangers: Passive or active heat exchangers replace the rear door of the IT rack with heat exchanging coils through which fluid absorbs heat produced in the rack. These systems are often combined with other cooling systems as either a strategy to keep room neutrality or a transitional design starting the journey into liquid cooling.
While direct-to-chip liquid cooling offers significantly higher density cooling capacity than air, it is important to note that there is still excess heat that the cold plates cannot capture. This heat will be rejected into the data room unless it is contained and removed through other means such as rear-door heat exchangers or room air cooling. For more detail on liquid cooling solutions for data centers check out our white paper.
AI Starter Kits for Retrofit and New Builds
Power and cooling are becoming integral parts of the IT solution design in the data room, blurring the borders between IT and facilities teams. This adds a high degree of complexity when it comes to design, deployment and operation. Partnerships and full-solution expertise rank as top requirements for smooth transitions to higher densities.
To simplify the shift to high density, Vertiv has introduced a range of optimized designs including power and cooling technology able to support workloads up to 100 kW per rack in a diverse set of deployment configurations.
Design summary | Racks | Density/rack | Green/Brown field | Heat removal | |
---|---|---|---|---|---|
from server | from room | ||||
Training model pilots, edge inferencing at scale |
|||||
Small HPC minimal retrofit | 1 | 70 kW | Brown field | water/glycol | air |
Small HPC retrofit for chilled water system | 1 | 100 kW | Brown field | water/glycol | water/glycol |
Centralized training for enterprise, AI corner in data center |
|||||
Mid-size HPC cost-optimized retrofit | 3 | 100 kW | Brown field | water/glycol | refrigerant |
Mid-size HPC with increased heat capture | 4 | 100 kW | Brown field Green field |
water/glycol+air | water/glycol |
Mid-size HPC pragmatic retrofit for air cooled computer rooms | 5 | 40 kW | Brown field Green field |
air | refrigerant |
Mid-size HPC | 5 | 100 kW | Brown field Green field |
water/glycol | water/glycol |
Large-scale AI factory |
|||||
Large HPC preserving room neutrality | 12 | 100 kW | Brown field Green field |
water/glycol+air | water/glycol |
Large HPC building towards scale | 14 | 100 kW | Brown field Green field |
water/glycol | water/glycol |
These designs offer multiple paths for system integrators, colocation providers, cloud service providers, or enterprise users to achieve the data center of the future, now. Each specific facility may have nuances with rack count and rack density dictated by IT equipment selection. As such, this collection of designs provides an intuitive way to definitively narrow down to a base design, and tailor it exactly to the deployment needs.
When retrofitting or repurposing existing environments for AI, our optimized designs help minimize disruption to existing workloads by leveraging available cooling infrastructure and heat rejection where possible. For example, we can integrate direct-to-chip liquid cooling with a rear-door heat exchanger to maintain a room-neutral cooling solution. In this case, the rear-door heat exchanger prevents excess heat from escaping into the room. For an air-cooled facility looking to add liquid cooling equipment without any modifications to the site itself, we have liquid-to-air design options available. This same strategy can be deployed in a single rack, in a row, or at scale in a large HPC deployment. For multi-rack designs, we have also included high amperage busway and high-density rack PDUs to distribute power to each rack.
These options are compatible with a range of different heat rejection options that can be paired with liquid cooling. This establishes a clean and cost-effective transition path to high-density liquid cooling without disrupting other workloads in the data room.
While many facilities are not designed for high-density systems, Vertiv has extensive experience with helping customers develop deployment plans to transition smoothly to high density for AI and HPC.
1 Management estimates: Comparison of Power Consumption & Heat Output at a rack level for 5 Nvidia DGX H100 Servers & 21 Dell PowerStore 500T & 9200T Servers in a standard 42U rack based on Manufacturer Spec Sheets