Search
History

Underlying Logic and Value Reconstruction: The Evolution of Liquid Cooling in AIDCs from Coolant Distributor to Intelligent Core

2026.04.16

Arenowned AI research institute was training multiple leading foundation models in its AIDC. During a critical training phase, the liquid cooling system failed, causing coolant to leak across multiple critical servers. The servers immediately overheated, triggering alarms and emergency shutdown on some devices. Despite the staff's emergency measures, some chips sustained irreversible damage from overheating. This incident halted a critical model training project, wasting significant prior investment in computation and time.

This is neither an isolated incident nor an exaggeration. With the boom of AI and big data, demand for data centers and computing power has grown exponentially, driving a substantial increase in device power density. In this context, traditional air cooling is no longer sufficient, and liquid cooling technology has emerged as an industry solution. At the core of the liquid cooling system, the Cooling Distribution Unit (CDU) is at a pivotal point in its transition from a "passive distributor" to an "active controller." The evolution of the CDU from a basic "cooling system component" into an "intelligent core" with management and control capabilities is critical for the stability, efficiency, and safety of the entire system.

Challenges: The Urgent Need for Liquid Cooling System Evolution in the Era of Intelligent Computing

The computing power race is intensifying. User demands for greater computing power are driving chip thermal design power (TDP) consumption to soar. Traditional air cooling technology has been insufficient to meet the cooling requirements of scenarios with high power density. Therefore, the liquid cooling system is seeing rapid large-scale adoption.

As liquid cooling technology accelerates its penetration, the market has grown rapidly in recent years. The global data center liquid cooling market reached US$870 million in 2024 and is expected to grow at a compound annual growth rate (CAGR) of 51.93% from 2024 to 2030, reaching US$10.7 billion by 2030, according to Arizton, a market research firm. Among liquid cooling technologies, the cold plate liquid cooling system is a widely used and mature solution. However, as intelligent computing demands greater safety, efficiency, and cost-effectiveness, cold plate liquid cooling systems struggle to meet the requirements of large-scale operations due to their weaknesses.

• Safety: Emergencies such as coolant leakage and coolant supply interruption may directly cause device breakdown, jeopardizing the continuity of intelligent computing services.

• O&M: The traditional mode of manual inspection and shutdown maintenance is inefficient. It also increases O&M costs and the risk of service interruptions.

• Energy efficiency: The heat-exchange efficiency of plate heat exchangers is low, and the device operation is inconsistent. Therefore, cooling efficiency is limited, and energy consumption remains high.

• Deployment: CDU cleaning and other preparations alone take 7 to 15 days. This significantly slows down the construction and capacity expansion of AIDCs.

These core pain points hinder the large-scale deployment and efficient operation of AIDCs.

Key to Breakthroughs: CDU Transformation

A liquid cooling system consists of three core components.

• The primary-side system: It includes the cooling tower, hydronic module, and chilled water pipes. It is technologically mature and highly standardized, but with limited optimization potential.

• The secondary-side system: It includes the coolant pipe and cold plate. The cold plate is closely coupled to the server chip layout. It is highly customized, making standardization difficult.


Key to Breakthroughs: CDU Transformation

• CDU: It acts as the hub connecting the primary-side and secondary-side systems. As the core of the entire system, it handles heat exchange and flow distribution. Notably, the CDU, as a core component that rapidly rises with the large-scale implementation of liquid cooling technologies, has a relatively short evolution period and remains in the stage of rapid technological iteration. Although it holds potential for standardization, there remains significant room for improvement in key technical dimensions, including product architecture integration, intelligent algorithm depth (such as multi-dimensional collaborative optimization in energy efficiency), and multi-scenario adaptability and flexibility (such as precise temperature control under extremely high-density computing power).

Traditional CDUs are used only as the distribution channel for a liquid cooling system. It passively distributes coolant and cannot make independent decisions or optimize. Huawei Thermal Management Unit (TMU) is not just an iteration of traditional CDUs, but an intelligent core that delivers safety, intelligent O&M, energy-efficiency optimization, and fast deployment. It proactively controls the entire liquid cooling process, representing a fundamental transformation from passive response to proactive prediction. Traditional CDUs are tools, whereas Huawei TMU is an intelligent hub. With its four core differences, the latter reshapes a liquid cooling system. It has emerged as the next-generation core device for the liquid-cooling upgrade of AIDCs, thanks to its groundbreaking capabilities.

Huawei TMU: Advancing CDU Evolution Toward an Intelligent Core

Based on Huawei's liquid cooling projects across multiple large-scale AIDCs, the following illustrates how Huawei TMU, unlike traditional CDUs in the industry, functions as an intelligent core across safety, O&M, energy efficiency, and deployment.

1. Safety core: from single-point protection to system-wide protection

Due to their reliance on basic hardware protection, traditional CDUs are prone to system breakdown caused by single points of failure (SPOFs). They are slow to recover from faults, and their coolant leakage and overpressure risks are unpredictable. Furthermore, they lack network security protection. Unlike them, Huawei TMU features a comprehensive safety system with four layers of protection.

• First layer: Dual AC/DC hot standby ensures seamless switching.

• Second layer: A 2N redundancy design for core components eliminates SPOFs.

• Third layer: Real-time, full-link (pressure, coolant quality, and conductivity) detecting and quick emergency response (20-second fast restart, 5-minute emergency coolant refill, and pumps directly powered by the mains supply to sustain cooling) ensure immediate fault recovery.

• Fourth layer: Public security product certification demonstrates the product's ability to prevent hacker attacks, fault propagation, and service interruptions, enabling it to achieve far higher security and reliability than traditional CDUs.

2. O&M core: from scheduled shutdown maintenance to intelligent and simplified in-service maintenance

Traditional CDU maintenance is challenging due to shutdown requirements, dependence on frequent manual inspection, and dedicated tools for coolant refilling, complex and time-consuming component maintenance, need for large room space, and high labor costs. Huawei TMU reshapes O&M through its modular, intelligent features. Core modules are hot-swappable and thus can be replaced without system shutdown. Flexible lifting casters and a front-and-rear access design enable efficient deployment and maintenance. The TMU also supports one-click self-diagnosis and intelligent coolant refill, eliminating the need for manual effort and dedicated tools. By upgrading O&M from scheduled maintenance requiring system shutdown to on-demand, in-service maintenance, Huawei TMU improves efficiency by over 50% and resolves all the pain points facing traditional CDUs.

3. Energy-efficiency core: from passive energy saving to full-link optimization

Traditional CDUs have low heat-exchange efficiency (approach temperature of 4°C to 8°C) and a fixed operating mode that cannot adapt to IT load fluctuations. Consequently, the energy consumption is high, and the power usage effectiveness (PUE) cannot reach the target of 1.1. In comparison, Huawei TMU achieves an energy efficiency leap through three core technologies. It uses a 304 stainless steel plate heat exchanger that reduces the approach temperature to 3°C, decreasing the energy consumption of the primary-side cooling source by 15%. With adaptive load adjustment, Huawei TMU enables the pump to stay efficient. Using AI-based collaborative optimization (flow precisely controlled within 5%, idle-unit hibernation and wakeup, and historical-data algorithm iteration), it maximizes system-wide energy efficiency. Huawei TMU has an annual PUE of less than or equal to 1.12 in real-world projects, whereas traditional CDUs struggle to break the 1.4 PUE bottleneck.

4. Deployment core: from complex and slow rollout to plug-and-play

A traditional on-site CDU requires creation of a low-impedance bypass path through dedicated connection tools, cyclic cleaning, and handover. Cleaning alone takes 7 to 15 days. The processes are complex, and the rollout period is long, significantly affecting the production efficiency of intelligent computing power. In contrast, Huawei TMU is pre-cleaned before delivery and can be put into use after on-site cleaning that takes just one to two hours. Through modular design and prefabrication, Huawei TMU shortens the deployment cycle by more than 90%. It addresses the issues of slow rollout and challenging implementation encountered by traditional CDUs.

Future Evolution: From Distributed Control to System-Wide Intelligence

The market for the CDU, a core component of liquid cooling, is growing rapidly. It has become a key technology behind the high-density and green development of data centers. For example, the AIDC liquid cooling market in China reached CNY18.4 billion in 2024 and is expected to grow to CNY130 billion by 2029, according to the China Academy of Information and Communications Technology (CAICT). Expanding market scale, continuous technological advancement, and policy incentives are the key drivers of future CDU growth.

The CDU's evolution from a simple coolant distribution unit to the intelligent control center of a liquid cooling system is more than just a technological upgrade. It represents the leap in AIDC liquid cooling systems from distributed control to system-wide intelligence. Future CDU advancements will focus on three key directions:

• Evolving from a single function to system-wide intelligence;

• Shifting from passive to predictive maintenance;

• Transitioning from standalone operation into group-based collaborative autonomy.

As demand for intelligent computing power soars, the CDU's evolution into an intelligent core will be the key to the liquid cooling industry's competitiveness. Huawei TMU represents not just an upgrade over traditional CDUs, but a reconstruction of the AIDC liquid cooling system. As an intelligent core, it offers autonomous decision-making and intelligent optimization capabilities for a liquid cooling system, resulting in breakthroughs across safety, reliability, O&M convenience, energy efficiency, and deployment efficiency. It sets a new standard for liquid cooling control units, driving the high-density, green, and intelligent development of AIDCs.

The implementation of Huawei TMU demonstrates this trend. By upgrading from a coolant distributor to an intelligent core, the TMU resolves pain points of liquid cooling systems and, more importantly, enables high-quality AIDC development.

Recommendations