图片石家庄护角胶厂
液冷入门指南 Liquid Cooling PrimerFebruary 13, 2025
By Dylan Patel,Jeremie Eliahou Ontiveros,Daniel Nishball And Reyk Knuhtsen
译 者 说随着AI的发展,动力需求激增,同时密度的发展也直接或间接动冷却系统的革命进程。液冷的动是基于节能考虑还是服务器散热需求所致,本篇从液冷系统的基本原理介绍到目前行业巨头液冷架构的实践间接,进行了详细分析,并对未来浸没式液冷和两相冷板技术进行了简要概述,相信读完会有的认识。
现在让我们来讨论下英伟达的路线图,以及数据中心设计的近期和长期未来,以及影响的设备供应商。我们相信,采用液体冷却背后的真正驱动因素仍然被误解了,对于理和训练数据中心的冷却系统的未来也被误解了。我们经常听说,采用液体冷却技术是由优越的能源率驱动的,或者因为用空气冷却>1000W芯片是不可能的。我们也经常听说,理将需要低功耗的服务器和空气冷却。
Let’s now discuss Nvidia’s roadmap and the near-term and long-term future of Datacenter design, and impact equipment suppliers. We believe that the real drivers behind liquid cooling adoption are still misunderstood, and so is the future of cooling systems for inference vs training datacenters. We’ve often heard that Liquid Cooling adoption is driven by superior energy efficiency, or because cooling >1000W chips with air is not possible. We also commonly hear that Inference will require low-power servers and air cooling.
AIDC液体冷却的兴起以及GenAI系统路线图AI Datacenters, the Rise of Liquid Cooling, and GenAI System Roadmap大规模采用液体冷却背后的真正驱动因素是GenAI计的总拥有成本(TCO)。虽然有些人认为液体冷却是昂贵的,但与个系统的生命周期(~15年)的总成本相比,IT设备在类似的时间框架内的总寿命成本要低得多。充分利用IT设备是真正重要的——这就是液体冷却的全部意义,使GPU和人工智能加速器在物理上接近彼此,允许多的加速器作为个计团队起工作。
The real driver behind the large-scale adoption of Liquid Cooling is theTotal cost of Ownership (TCO) for GenAI compute. While some argue that liquid cooling is expensive, the total cost over the lifetime of a system (~15 years) is ridiculously low compared to the total lifetime cost for the IT Equipment over a similar timeframe. Getting the most out of the IT Equipment is what really matters – and this is what Liquid Cooling is all about, enabling GPUs and AI Accelerators to be physically much closer to one another, allowing greater number of accelerators to work together as a team on calculations.
英伟达的GB200 NVL72为LLM断提供了佳的TCO,与Hopper系统相比,其能提了达10倍。这在很大程度上是由于可扩展的网络NVlink从8个GPU扩展到72个GPU——只有通过增加机架密度和每个机架内的数千个铜链路才能实现。英伟达的路线图是明确的,并逐步将通过NVlink连接的500+图形处理器的机架密度向1MW。像trainiumandtpu这样的定制AI加速器共享类似的路线图——增加密度,通过快的扩展和扩展网络实现的能。下面来自维谛的幻灯片提供了英伟达路线图的简化概述。
Nvidia’s GB200 NVL72 offers the best TCO for LLM inferencewith up to 10x increase in performance vs a Hopper system. This is in large part due to theextension of the scale-up network NVlink from 8 GPUs to 72 GPUs– only achievable through increased rack density and thousands of copper links inside each rack. Nvidia’s roadmap is clear and will progressively push rack density towards 1MW at 500+ GPUs connected via NVlink. Custom AI Accelerators like Trainiumand TPUshare a similar roadmap – increasing density to enable higher performance through faster scale-up and scale-out networks. The below slide from Vertiv provides a simplified overview of Nvidia’s roadmap.
图片
而关键的客户也有同样的抱负。在2024年OCP全球峰会上,谷歌讨论了开发多达1兆瓦IT机架的计划。为了实现这点,动力机架将离开机架,并将引入侧板“动力机架”。配电电压将从48V DC上升到+-400V DC。英伟达与Rubin有类似的计划,但保留48V。
And key customers share the same ambition. At OCP Global Summit 2024, Google discussed plans to develop up to 1MW IT racks. To enable this, power shelves will move off racks and sidecar “power racks” will be introduced. Power distribution voltage will rise from 48Vdc to +-400Vdc. Nvidia has similar plans with Rubin but staying 48V.
图片
Meta和微软近建立了个伙伴关系,开发个类似的解决案——Mt。暗黑破坏的项目。关键概念是“分解电源”,电源架从机架上移到个用的侧板。
Meta and Microsoft recently formed a partnership to develop a similar solution – the Mt. Diablo project. The key concept is “disaggregated power”, with power shelves moving off the rack into a dedicated sidecar.
图片
微软和Meta也对他们的1MW野心非常开放,如下所示。和谷歌样,个关键的原因是不断上升的配电电压——48V DC配电母线将被+-400V DC取代。除了提率,这是增加密度的关键:个48V DC铜气泡驱动500 kW机架需要56毫米直径和重量47公斤/ 103磅,而+-400V DC(即有电压800V)只有14毫米直径和重量3公斤/ 6.5磅!
Microsoft and Meta are also very open about their 1MW ambition, as shown below. Like Google, a key enabler is the rising distribution voltage – the 48Vdc distribution busbar will be replaced by +-400Vdc. In addition to improved efficiency, this is key to increasing density: a 48Vdc copper bubsar powering a 500kW rack would require a 56mm diameter and weigh 47 kg / 103 lbs, while a +-400Vdc (i.e. effective voltage 800V) would only have a 14mm diameter and weigh 3kg / 6.5 lbs!
图片
现在让我们退步,讨论下液体冷却到底是什么,以及它与空气冷却的比较。这份报告将主要集中在冷板式液冷(DLC)——我们将在后简要介绍浸没,但未来的文章将入地研究这个主题。
Let’s now take a brief step back and discuss what Liquid Cooling actually is and how it compares to Air cooling. This report will largely focus on Direct to Chip Liquid Cooling (DLC) – we will briefly touch on immersion at the end, but a future article will dig much deeper into this subject.
先,当我们谈到液体冷却时,我们特别指的是芯片或服务器的液体冷却,而不是空气处理器和风扇的液体冷却。如前所述,许多数据中心已经使用设施中的水回路(液体)来去除数据大厅里的热量,但很少有人使用液体来去除芯片或服务器上的热量。几十年来,液体冷却已经为不同行业的小众用例实现了,包括60年代用于冷却IBM大型机的数据中心!然而,在现代数据中心,空气冷却主要是选,因为:
First, when we talk about Liquid Cooling, we are specifically referring to liquid cooling the chips or servers and not liquid cooling of air handlers and fans. As discussed earlier, many data centers already use facility water loops (liquid) to remove heat from data halls, but few use liquid to remove heat from chips or servers. Liquid cooling has been implemented for niche use cases for decades in various industries, including Datacenters built in the 60s to cool IBM mainframes! In modern Data Centers, air cooling is, however, largely preferred, because:
它通常便宜,简单,空气冷却的供应链已经建立。
It is generally cheaper upfront, simpler, and supply chains for air cooling are well established.
随着设备规模的扩大,风冷技术也得到了改进,并在保持能源率的同时,每个机架的功率密度也不断上升。
As facilities have grown larger, air-cooling technology has also improved and kept up with rising power densities per rack while maintaining energy efficiency.
因此,虽然许多参与者同意液体冷却通过降低能源消耗(通常通过减少或移除服务器风扇功率)可以实现运营成本节省(约10),但与液体冷却的资本支出、增加的复杂和运营风险以及不太了解的供应链相比,这不足以成为转向液体冷却的强烈动机。众所周知,液体冷却可以实现密度,从而节省空间——但物理空间对数据中心来说是个小成本项目,因为大多数成本是由关键IT功率衡量的。
Therefore, while many participants agree that liquid cooling allows OpEx savings via lower energy consumption(~10)typically through the reduction or removal of server fan power, this was not a strong enough incentive to move to liquid, when compared to the higher capex of liquid cooling, increased complexity and operational risk, and less understood supply chains. It is also well-known that Liquid Cooling enables higher density and this saves space – but physical space is a small cost item for datacenters as most of the costs scale by critical IT power.
液体之所以有率,并且可以允许的密度,是因为它每单位体积吸收的能量比空气多4000倍。在能源率面,这被泵和复杂的管道需求部分抵消:水的密度比空气大830倍,这使得移动加困难。
The reason why Liquid is more efficient and can allow higher densities is that it can absorb ~4000x more energy per unit of volume than air. On the energy efficiency side, this is partly offset by the need for pumps and sophisticated plumbing: water is ~830x denser than air, making it harder to move.
注意,液体流量和泵送能量有线关系。然而,通过DLC增加机架密度有它自己的系列挑战——特别是在管道和管道工程。的密度可能需要非常大的管道和昂贵的材料。
Note that liquid flow rate and pumping energy have a linear relationship – unlike fan power and airflow. However, increasing rack density through DLC has its own set of challenges – notably around piping and plumbing. Extra-high density potentially requires very large pipes with expensive materials.
图片
如今,风冷仍然是人工智能域的主技术。英伟达针对H100数据中心部署的参考设计为每个机架多提供4台风冷服务器,总计41 kW。参考设计解释说,在大多数风冷数据中心,冷却过载经常限制操作员每行安装8个GPU服务器机架(以及另外两个用于存储和网络设备),这意味着在这个设计中有8个机架是空的!然而,也有可能进步增加密度——些技术,如冷门背板(RDHx)或柜内空气密封技术,如DDC柜技术,使每个机架的密度过50 kW。
Today, air-cooling is still a dominant technology in the AI world. The Nvidia reference design for H100 datacenter deployment provides for at most four air-cooled servers per rack totaling 41kW. The reference design explains that in most air-cooled datacenters, cooling oversubscription often limits operators to installing 8 racks of GPU servers per row (and two others for storage and networking equipment), meaning that 8 racks are empty in this design! However, it is possible to further increase density – some technologies such as Rear Door Heat Exchangers (RDHx) or in-cabinet air containment like that of DDC Cabinet Technology enable densities above 50kW per rack.
图片
密度的主要限制原因之是在服务器别。TDP芯片需要个大的散热器——这就是为什么拥有8个图形处理器的英伟达 HGX服务器往往非常大(8RU),而这种TDP和散热需求将在未来代中增加,如风冷的Blackwell SKUs.
One of the key limits on density is at the server level. High TDP chips require a much larger heat sink – which is why Nvidia HGX servers with eight GPUs tend to be very large (8RU), and this TDP and heat dissipation need will increase in future generations such as air-cooled Blackwell SKUs.
相比之下,在服务器中引入液体可以在类似的功率消耗下实现紧凑的设计。英伟达为 Blackwell的大多数sku选择了直接到芯片的单相技术:这种设计包括使用直接放置在热门的芯片(GPU和CPU)上的铜板。风扇仍然需要风扇移除非液体冷却过的部件中剩余的热量,如网卡、存储器和收发器——可达机架总散热量的15。
In contrast, introducing liquid inside the server enables much more compact designs at a similar power draw. Nvidia opted for the Direct-To-Chip Single-phase technology for most SKUs of Blackwell: this design involves the use of copper plates placed directly on top of the hottest chips (GPUs and CPUs). Fans are still required to move the remaining heat from components that are not liquid cooled, such as the NICs, storage and transceivers – up to 15 of the total rack heat dissipation.
图片
这些金属板被提供了冷水,送出温水。这个水环管通过机架内的个分集水器流动。
These plates are supplied with cold water and return warm water. That water loop flows through a manifold inside the rack.
图片
该回路通常由个冷却液分配单元(CDU)来处理,如下图所示。
The loop is generally handled by a Coolant Distribution Unit (CDU), as shown below.
图片
奥力斯 万能胶生产厂家 联系人:王经理 手机:13903175735(微信同号) 地址:河北省任丘市北辛庄乡南代河工业区
CDU可以是大的和集中的排在列内的单元(1MW+容量),也可以是小的机架内的单元(~100kW为4U)。对于大型部署,行内部署通常是选,因为它便宜,而且维护容易(组件少)。然而,考虑到上市时间的基本价值,以及供应链相对较新,如果出现问题,终端客户希望“指向”供应商,而不是进行长时间的指责会议,试图确定故障所在。因此,像微这样的OEM的集成机架解决案供应商在大规模部署中获得了定的吸引力。在这两种情况下,CDU都位于IT机房内。
CDUs can be either large and centralized in-row units (1MW+ capacity), or smaller in-rack units (~100kW for 4U). For large deployments, in-row is generally the preferred option as it is cheaper, and maintenance is easier (less components). However, given thecrucial value of time-to-market, and supply chains being relatively new and end-customers would like to “point the finger” at one vendor if something goes wrong instead of holding lengthy blamestorming sessions trying to identify where the fault is. As such, integrated in-rack solutions from OEMs like SuperMicro are getting some traction for large-scale deployments. In both cases, CDUs are located inside the IT room.
图片
下图显示了真实生活中的CDU:个是Rittal的行CDU(1MW冷却能力),另个是机柜底部机架内的CDU,冷却能力为80 kW。
The below picture shows real-life CDUs: to the right, an in-row CDU from Rittal (1MW of cooling capacity), and to the right, an in-rack CDU at the bottom of the cabinet, with a cooling capacity of 80kW.
图片
图片
数据中心冷却系统的未来The Future of Datacenter Cooling Systems在设施层面上,未来两年将会出现三种主要类型的部署。种是使用液对空气(L2A)热交换器。由于L2A系统的成本和固有的低率,这是率低也是昂贵的选择。由于GB200 NVL36/72需求与适DLC的数据中心容量不匹配,这可以被认为是个“桥梁”。
At the facility level, the next two years will see three main types of deployments. The first one is the use of Liquid-to-Air (L2A) heat exchangers. This is the least efficient option andthe most expensive due to the high cost and inherent inefficiency of L2A systems. This can be considered as a “bridge” due to the mismatch between GB200 NVL36/72 demand and datacenter capacity suitable for DLC.
图片
许多为空冷而优化的数据中心(如我们之前介绍的微软和AWS设计)仍在建设中,但其任务是部署GB200——这里唯的解决案是使用L2A。L2A热交换器不需要设施水平的管道。在L2A侧和IT机架的DLC系统之间运行个封闭的液体冷却回路,将热回液泵入L2A侧,从而冷却液体并将其送回IT机架。然后,L2A单元使用散热器和强大的风扇将热量从液体转移到热空气到热通道。标准的设施系统,如费空气冷却,然后可以从数据大厅去除热量。
Many Datacenters optimized for air-cooling (like the Microsoft and AWS designs we previously covered) are still under construction but are being tasked to deploy GB200 – the only solution here is to use L2A. An L2A heat exchanger doesn’t require facility-level plumbing. A closed liquid cooling loop is run between the L2A sidecar and the IT rack’s DLC system, with hot return liquid being pumped into the L2A sidecar, which cools the liquid and returns it back to the IT rack. The L2A unit then uses a radiator and powerful fans to transfer heat from this liquid to hot air onto the hot aisle. Standard facility-level systems like Free Air Cooling can then remove the heat from the data hall.
图片
根据英伟达的说法,这种L2A系统的TCO明显,甚至比标准的风冷系统。我们不知道这个计背后的基本假设是什么,但我们暂且赞同这个结果。
According to Nvidia, the TCO for such an L2A system is significantly higher, even compared to a standard air-cooling system. We don’t know what the underlying assumptions behind this calculation are but we directionally agree with the result.
图片
正在采用的二种选择是个“混”冷却系统,其中个集中冷水系统厂可以从空气和液体中去除热量。CDU (L2L,论是机架内还是行内)和CRAH/风扇墙与普通冷冻水管道系统交换热量-如维谛的以下设计所示。正如我们在GB200硬件架构报告中解释的,只有GPU和NVLink开关是DLC冷却的,而 NICs, CPUs和许多其他各种IT设备仍然是风冷的。英伟达的GB200设计要求运营商通过DLC提供85的冷却,剩下的15仍然通过空冷。
The second option being adopted is a “hybrid” cooling system, where a central chilled water plant can remove heat from both air and liquid. The CDU (L2L, whether in-rack or in-row) and the CRAH/Fan Wall exchange heat with a common chilled water piping system – as shown in the below design from Vertiv. As we explained in our GB200 Hardware Architecturer eport, only the GPUs and NVLink switches are DLC cooled, while the NICs, CPUs, and many other assorted IT equipment are still air cooled. Nvidia’s GB200 design requests that operators provide for 85 of cooling via DLC, with the remaining 15 still via air-cooling.
图片
下面来自维谛的示例展示了这样的布局的样子。
The below example from Vertiv shows how such a layout would look like.
图片
我们预计大多数数据中心将在2025年和2026年采用这种混部署。许多人仍然不确定确切的液体和空气冷却混,而混动力系统提供了灵活来处理这个比率的轻微变化。许多数据中心运营商目前正在宣传他们部署和改造DLC的能力。这是因为,如前所述,大多数主机托管数据中心已经有了个集中冷水系统。
We expect most Datacenters will go with this hybrid deployment in 2025 and 2026. Many are still unsure of the exact Liquid vs Air cooling mix, and Hybrid systems provide the flexibility to deal with slight changes to this ratio. Many datacenter operators are currently advertising their ability to deploy and retrofit DLC. This is because, as previously explained, most Colocation datacenters already have a central chilled water system.
理论上,改造工作很简单--在数据机房内新建管道系统,并让CDU从现有的集中水系统中“断开”。在实践中,我们认为许多运营商由于缺乏标准化而陷入困境,尽管可以完成改造工作,但成本相对较。数据中心有不同的集中水管道系统(流量、压力、直径等)和流体混(用于设施冷却的水,而水和乙二醇的混物用于DLC内部回路,但会有所不同)。 市场上可获得的CDU可能不符所需的规格,而且大多数CDU由于缺乏标准而彼此不同。因此,大多数运营商须使用新的设计来构建数据中心,PVC管道管件粘结胶以容纳L2L液体冷却。另个问题是,使用个共享设施水循环意味着数据大厅冷却设备(CDU和空气处理机)被迫使用相同的设施水,其温度可能对于两者都不是优的。
In theory, the retrofit work is easy – build a new piping system inside the data hall and have the CDU “tap-off” from the existing central water. In practice, we believe many operators are struggling due to a lack of standardization, and although the retrofit work can be done, it is relatively expensive. Datacenters have different central water piping systems (flow rate, pressure, diameter etc) and fluid mix (water used for facility cooling, while water and glycol mix is used for DLC inner loops, but it varies). Merchant CDUs available on the market may not match the desired specifications, and most merchant CDUs are built differently to each other – again due to a lack of standards. Therefore, most operators must build datacenters with new designs to accommodate L2L Liquid Cooling. Another problem is that using one shared facility water loop means that data hall cooling equipment (CDUs and air handlers) are forced to work with the same facility water with a common temperature that might be optimal for neither.
混动力系统的二个问题是能源率。下图来自OVHCloud,这是法国云服务提供商,它使用DLC进行CPU冷却已经过20年了!虽然密度远低于英伟达的>130KW,但我们仍然可以看到这个问题。
The second issue with a hybrid system is energy efficiency. The below diagram comes from OVHCloud, a French cloud service provider that has been using DLC for CPU cooling for over two decades! While density is much lower than Nvidia’s >130kW, we can still see the issue.
混冷却的问题是需要适应具有不同传热能的系统。液对液热交换器(如板式热交换器)具有好的能,而空气对液(A2L)热交换器的率明显较低。在这种情况下,OVH同时使用DLC和RDHx(即机架内的A2L)与相同的设施水系统。由于传热能较差,我们须在明显较低的入口温度下操作RDHx(或中央风扇墙或CRAH)。在这种情况下,相比于DLC系统的温度为45℃,这个需要采用30℃。由于这两个系统共享个共同的中央水系统,OVH被迫将设施中的水冷却到27℃。它们的1.26度仍然令人羡慕,但这得益于它们的地理位置在法国特别偏北部地区。
The problem with Hybrid Cooling is the need to accommodate systems with different Heat Transfer performance. A Liquid-to-Liquid heat exchanger (like a Plate Heat Exchanger) has the best performance, while Air-to-Liquid (A2L) exchangers are significantly less effective. In this case, OVH uses both DLC and RDHx (i.e. in-rack A2L) with the same facility water system. Due to the inferior heat transfer performance, we must operate the RDHx (or alternatively a central Fan Wall or CRAH) at significantly lower inlet temperatures. In this case, a 30°C temperature compared to a 45°C temperature for the DLC system. With both systems sharing a common central water system, OVH is forced to cool facility water down to 27°C. Their PUE is still enviable at 1.26, but this is helped by their Geographic mix being heavily tilted towards the North of France.
图片
用冷却系统Dedicated Cooling Systems图片
在这种配置下,DLC用系统可以在明显的温度下运行。施耐德的参考设计指向个37℃的入口温度和47℃的出口温度。在这些条件下,DLC冷却回路很可能可以使用没有热辅助的干式冷却器全年运行——能量和水被小化。低PUE的影响可能是显著的——在人工智能时代,持续的电力短缺,每个兆瓦都很数。通过电网保护200兆瓦的数据中心运营商将在1.15峰值为IT设备提供174MW,而在1.3峰值为154MW。
In this configuration, the dedicated system for DLC can operate at significantly higher temperatures. Schneider’s reference design points to a 37°C inlet temperature and 47°C outlet temperature. With these conditions, the DLC cooling loop can most likely operate all year long using a dry cooler without adiabatic assist – energy and water are minimized. The impact of lower PUE can be significant – in the AI era, there isan ongoing power shortage and every MW counts. A datacenter operator securing 200MW with the power grid would have 174MW available for IT Equipment at 1.15 peak PUE, versus 154MW at 1.3 peak PUE.
用的冷却系统通常通过放大余量满足定程度的灵活。例如,个100MW的数据中心可以有85MW的DLC冷却能力和25MW的空气冷却能力。然而,运营商建造这样个设施的风险是,GenAI和液体冷却系统的采用会逐渐消失。
Dedicated cooling systems typically involve some degree of flexibility through oversizing. For example, a 100MW Datacenter could have 85MW of cooling capacity for DLC and 25MW for Air. Nevertheless, the risk for an operator building such a facility is that GenAI and Liquid Cooling adoption fades.
用的冷却系统通常通过过大尺寸涉及定程度的灵活。例如,个100MW的数据中心可以有85MW的DLC冷却能力和25MW的空气冷却能力。然而,运营商建造这样个设施的风险是,GenAI和液体冷却系统的采用会逐渐消失。
However, we built a comprehensive Liquid Cooling model and believe this scenario is extremely unlikely. Contact us for full details of our model and TCO on different cooling systems.
行业巨头的设计和路线图-CDU的终结?Hyperscaler Designs and Roadmap – The End of CDUs?我们之前解释过,微软、Meta和AWS数据中心通常经过风冷优化。因此,L2A是快速部署GB200 NVL36/72的主要选择。但展望2025年的部署,行业巨头已经做出了反应,出了新的设计。引人注目的转变是Meta,正在建造的“H”被废弃,转而建造个新的人工智能设计——建造速度快,密度大,并且有用的水系统。
We previously explained that Microsoft, Meta and AWS datacenters are typically optimized for air-cooling. L2A is therefore the main option to rapidly deploy GB200 NVL36/72. But looking beyond 2025’s deployments, hyperscalers have reacted and launched new designs. The most notable shift is Meta’s, with an “H” under construction scrapped to build instead a new AI-ready design – much faster to build, denser, and with a dedicated water system.
图片
新的设计采用了风冷冷却器和个同时处理空气和水的“混”系统。它将保持同行业好的水利用率,但我们预计PUE会上升——我们不确定到底有多少,可能从目前的1.08水平上升至过1.10。
The new design uses air-cooled chillers and a “hybrid” system handling both air and water. It will maintain best-in-class water efficiency, but we expect PUE to go up – we aren’t sure exactly by how much, likely to above 1.10 from the current 1.08 levels.
图片
微软初采用了种不同的策略。该公司继续开发其“巴拉德”数据中心,但仅选定地点的在很少部分引入了种新的密集设计。
Microsoft initially employed a different strategy. The company kept developing its “Ballard” datacenters, but introduced a new ultra-dense design in very few select locations.
图片
新设计采用了水冷冷却器和干冷却器(热辅助)——后者的“V型”设计类似于开式冷却塔,与典型的干冷却器相比,提了空间利用率。
The new design uses a combination of water-cooled chillers and dry coolers (with adiabatic assist) – the latter with a “vertical” design resembling that of open-loop cooling towers, improving space efficiency compared to typical dry coolers.
图片
近,这科技巨头出了巴拉德设计的变体,在亚利桑那州凤凰城出。这是个“混”系统,CDU和风墙共享个集中的设施水系统,并依赖于传统的风冷冷水机。
More recently, the tech giant introduced a variation of its Ballard design, with a first iteration in Phoenix, Arizona. This is a “hybrid” system with both CDUs and Fan Walls sharing a centralized facility water system and relying on traditional air-cooled chillers.
图片
AWS也宣布向DLC转变,但我们还没有看到该公司的任何新的数据中心设计提供足够的设施水平流体冷却器,如干冷却器或冷却塔。该公司的下代设施仍然为空气冷却进行了优化。因此,我们相信,该公司在2025年进行的大部分GB200部署都将使用L2A系统。
AWS has also announced a shift towards DLC, but we are yet to see any new datacenter design from the company with sufficient facility-level fluid coolers like dry coolers or cooling towers. The company’s next-gen flagship facilities remain optimized for air cooling. Therefore, we believe that most of the company’s GB200 deployments in 2025 will use L2A systems.
图片
后,让我们讨论谷歌。该公司已经部署液冷芯片已经十多年了。其目前的数据中心设计是个同时容纳空气和液体的混系统,但该公司在OCP 2024上披露了个有趣的路线图。随着机架密度向1兆瓦移动,机架内CDU将不是个选择,甚至行内CDU也将不够。这种设备通常的容量为1MW,容量为2MW——每个机架将需要个用的侧板CDU。
Lastly, let’s discuss Google. The company has deployed liquid-cooled chips for over a decade. Its current datacenter design is a hybrid system accommodating both air and liquid, but the company has disclosed an interesting roadmap at OCP 2024. With rack density moving towards 1MW, in-rack CDUs won’t be an option and even in-row CDU won’t be enough. Such units typically have a capacity of 1MW and up to 2MW – each rack would require a dedicated sidecar CDU.
谷歌正在评估清除CDU,并将设施中的冷冻水直接送入机架。我们之前讨论了该行业缺乏标准化的问题,我们认为“冷冻水直接到机架”的解决案不太可能出现在广泛的生态系统中。然而,像谷歌这样的上下游整公司控制着从芯片(TPU)到数据中心的整个系统,可以从头开始设计个系统来满足这些要求。
Google is evaluating removing completely CDUs and delivering facility chilled water directly into the rack. We discussed earlier the lack of standardization in the industry, and we believe a “chilled water direct to rack” solution is unlikely to emerge in the broader ecosystem. However, vertically integrated companies like Google controlling the entire system from chip (TPU) to datacenter can design a system from the grounds up to match those requirements.
图片
设备供应商形势 Equipment Supplier Landscape在本报告的后部分中,我们将讨论供应商的情况。我们已经建立了个供应商跟踪器,覆盖了150个+设备和服务供应商,与数据中心的收入暴露和产品细分。这致个自下而上的市场份额估计产品-联系我们多信息。
In the final part of this report, we will discuss the supplier landscape. We have built a supplier tracker covering 150+ equipment & service vendors, with datacenter revenue exposure and breakdown by product. This results in a bottom-up market share estimate by product – contact us for more information.
数据中心冷却空间有两个不同的类别:传统的设施设备和液体冷却用系统。
There are two distinct categories in the Datacenter cooling space: traditional facility-level equipment, and liquid cooling-specific systems.
在类中,市场动态类似于电气系统,但通常在个加分散和竞争激烈的环境中。在这类有许多非常大的公司,包括数据中心设备巨头,如维谛技术和施耐德电气,还有全球先的暖通公司,如开利、江森自控和特灵,以及些大型数据中心和纯空调设备,如世图兹。与电气设备供应商相比,这种市场结构致毛利率较低,但营业利润率仍然很健康,每个人都将从市场动态中受益。数据中心的资本支出的繁荣仍然被低估了。
In the first category,market dynamics resemble that of Electrical systems, but in a generally more fragmented and competitive environment. There are a number of very large companies in this category, including Datacenter Equipment giants like Vertiv and Schneider Electric, but also global HVAC leaders such as Carrier, Johnson Controls and Trane, and a few large datacenter and telecom HVAC pure plays like Stulz. This market structure leads to lower gross margins compared to electrical equipment suppliers, but operating margins are healthy nonetheless and everyone will benefit from the market dynamics. The datacenter CapEx boom remains underestimated.
在这个市场中,些公司注于特定类别,有赢也有输。例如,过度暴露于空气侧节能系统的公司可能会相对损失,因为大规模的自建设计从自由空气冷却转向设施水循环。
Within that market, some firms specialize in specific categories and there are winners and losers. For example, companies overly exposed to airside economizer systems are likely to lose relatively, as hyperscale self-build designs move from free air cooling to facility water circuits.
另个可能相对损失的产品类别是开式冷却塔,因为数据中心规模的激增将限制有足够可用水的场地。如果谷歌转向干式系统,美国上市供应商将受到严重的影响。
Another likely relative losing product category is open-loop cooling towers, as the surge in datacenter scale will limit sites with enough available water. One US-listed supplier would particularly suffer significantly if Google were to shift to dry-type systems.
另个关键域是液体冷却设备-引发了大量的关注和讨论,特别是许多来自台湾的新进入者试图挑战像Vertiv这样的老公司,而些巨头在技术和设备组面因依赖并购而措手不及(施耐德电气/Motivair)。虽然L2A解决案应被视为“桥梁”,但它们仍可能在2025年大规模采用,这将动大供应商的收入大幅增长-原因是每MW的价格非常。 对于L2L系统,我们认为有几公司正面临可靠问题,但在美国上市的公司表现特别出,并有望获得显著的市场份额。
The other key space is Liquid Cooling equipment – triggering a lot of attention and debate, with many new entrants from Taiwan in particular trying to take on established firms like Vertiv, and giants left wrong footed in terms of technology and equipment portfolio relying on M&A to catch up (Schneider Electric/Motivair). While L2A solutions should be viewed as a “bridge”, they are still likely to be adopted at scale in 2025 which will drive massive revenue upside for the largest vendors – due to a very high price/MW. For L2L systems, we believe that several firms are facing reliability issues, but one US-listed company is doing particularly well and is positioned to take a significant market share.
DLC是个可以被浸没式所取代的临时解决案吗?Is DLC a Temporary Solution to be Replaced by Immersion?后,我们想简要介绍下未来的冷却技术。虽然今天的单相DLC将会出现大规模增长,但目前正在进行研发工作,以开发两相解决案和浸没式冷却。我们经常听到浸没式冷却是终的解决案,或者两相比单相DLC有。
Lastly, we want to touch briefly on future cooling technologies. While today single-phase DLC is set to ramp up massively, there is an ongoing R&D effort to develop two-phase solutions and immersion cooling. We often hear the narrative that immersion is the ultimate solution, or that 2-phase is much more efficient than single phase DLC.
我们将在今后的报告中详细探讨这些技术。作为个简短的预告片,我们只能说,我们相信市场正在误解这些系统背后的物理原理。浸入式通常被描述为进步增加机架密度的下个解决案,但我们不同意。强迫水对流即DLC比自由对流具有好的传热能。虽然目前仍在努力浸没式散热器,但其峰值密度仍然低于DLC。
We will explore those technologies in detail in a future report. As a short teaser, let’s just say that we believe the market is misunderstanding the physics behind those systems. Immersion is often depicted as the next solution to further increase rack density, but we disagree. Forced water convection i.e. DLC has a superior heat transfer performance compared to free convection with single-phase oil i.e. immersion. While there is an ongoing effort to improve immersion heat sinks, their peak density remains inferior to that of DLC.
两相的DLC也被探索,虽然有警告和短期问题,但该技术有前途。英伟达数据中心冷却和基础设施主管的演示显示,两相已成为扩展其系统机架密度的关键技术。
Two-phase DLC is also explored, and while there are caveats and short-term issues, the technology is more promising. An Nvidia presentation by their Director of Data Center Cooling and Infrastructure shows Two-phase becoming a key technology to scale their system’s rack density.
图片
甲骨文和字节跳动在PTC 2025上热议Oracle & Bytedance Market Rumblings at PTC 2025上周,我们派遣了两位出的团队成员参加了在夏威夷檀香山举行的 2025 年太平洋电信会议。有些重量公司出席了会议,包括 Microsoft、Oracle、DayOne (FKA GDS)、CtrlS、AirTrunk、Ciena、STT 和 ByteDance。这是场不容错过的活动。
We sent two of our beautiful team members to the Pacific Telecommunications Conference 2025 in Honolulu, Hawaii this past week. There were some heavy hitters in attendance, ranging from Microsoft, Oracle, DayOne (FKA GDS), CtrlS, AirTrunk, Ciena, STT, and ByteDance. It was not an event to miss.
在那里,我们听到了很多关于冷却域即将到来的发展,以及关于未来 GPU 部署和地理风险的讨论。在柔佛州,我们确认,只要获得 NVEU 例外,Oracle 应该会在 2025 年上半年部署其液冷集群。总体而言,只有少数运营商进行了液冷部署,而且许多运营商仍然保持谨慎。我们仍然认为,马来西亚的 130kW 机架建设仍然是全球激进的建设项目之。许多运营商指出,实施 Direct-to-Chip 液体冷却的成本非常且难度很大,尤其是当 DtC 的入口温度需要使用冷水机冷却时,增加了成本并恶化了 PUE。此外,许多主机托管提供商向我们证实,他们愿意与成熟的设备供应商作(特别点名施耐德电气和维谛),而不是引入新贵冷却组件供应商。
While there, we heard plenty about the upcoming developments in cooling alongside conversation on the future GPU deployments and geographic risks. In Johor, we confirmed that in the first half of 2025 Oracle should see their liquid cooled clusters being deployed, so long as their NVEU exception is granted. Overall, only a handful of operators have liquid cooled deployments and many remain cautious. We still believe that Malaysia’s 130kW rack builds remain among the most aggressive buildouts globally. Many operators pointed out that there is a very high cost and difficulty in implementing Direct-to-Chip liquid cooling, especially when the inlet temperature for DtC needs to be cooled with Chillers, adding cost and worsening PUE. Additionally, many colocation providers confirmed to us that they prefer working with an established equipment vendor- Schneider Electric and Vertiv were specifically named- rather than introduce upstart cooling component vendors.
DKC交流群邀请
图片
知社创立于2017年,是数字能源行业工程技术人员学习、交流、作、创新的虚拟社区平台。知社秉承互联网精,践行工程师文化,补充全球视野、终身学习、乐于分享博学知理念,感兴趣的同学申请扫描二维码进群交流学习。
图片
知社翻译:
陆月
DKI(DeepKnowledge Intelligence)研究员
公众号声明:
本站仅提供存储服务,所有内容均由用户发布,如发现有害或侵权内容,请点击举报。 相关词条:铁皮保温施工 隔热条设备 锚索 离心玻璃棉 万能胶生产厂家1.本网站以及本平台支持关于《新广告法》实施的“极限词“用语属“违词”的规定,并在网站的各个栏目、产品主图、详情页等描述中规避“违禁词”。
2.本店欢迎所有用户指出有“违禁词”“广告法”出现的地方,并积极配合修改。
3.凡用户访问本网页,均表示默认详情页的描述,不支持任何以极限化“违禁词”“广告法”为借口理由投诉违反《新广告法》石家庄护角胶厂,以此来变相勒索商家索要赔偿的违法恶意行为。