I am not referring to a major utility power failure, just to a common pitfall of power distribution system practice and management. There are several basic, but key, power components in the data center.
A common practice in today’s mission-critical data center environment is to equip servers with dual power supplies for improved reliability. However, if improperly implemented, it could increase the likelihood of power failure.
In a “perfect” scenario, such as a Tier 4 data center, there are two completely independent power paths, each composed of items 1-6. Each path and the items in the path must be capable of supporting 100 percent of the entire data center load by itself. This represents true 2N redundancy, which means that no single point of failure will interrupt the operation of the data center equipment.
Of course, not everyone is fortunate enough to operate a Tier 4 data center. While we would all like to have complete power system redundancy, cost usually forces some trade-offs. While we try to ensure the highest level of system fault tolerance within budget restrictions, this usually means that although servers have dual PSes, there are not two completely independent paths for items 1-5.
More Common Scenario
The more common scenario is two rack-level PDUs and that (hopefully) each of the server’s PS cords are plugged into a different PDU. This creates a “sense” of redundancy for most administrators. In reality, this is where the hidden exposure to power problems starts.
Let’s look at how and why this seemingly simple and common practice is the potential cause of power failures in the data center. In most cases, the dual supplies will share the server load at approximately 50 percent each, when both supplies are active. However, if either PS fails or has lost input power, the remaining PS must draw 100 percent of the power required. Here is where the problem materializes.
Servers are normally installed and operated with both rack-level PDUs available. Typically each PS would only draw 50 percent of the server’s power requirement. Normally, the PDU load is less (again hopefully) than the trip value of the circuit breaker that protects it. In fact, even if the PDU has a current meter, most administrators would think they have the capacity to add more servers if they are “only” at a 60 percent power level.
At 60 percent, the PDUs are overloaded and no one even realizes it!
Here is why: If a server or blade server experiences a PS failure, then 100 percent of the power will be drawn from the remaining PS and therefore from the PDU that supports it. This means that at a 60 percent load (at normal conditions), 120 percent of the PDU’s power rating will be put on the remaining PDU and the circuit breaker will trip on the PDU (or branch breaker), shutting down all equipment in that rack. This is a classic cascade failure. The same scenario would hold true if another server or other equipment was added that overloaded the PDU load past the tripping point of either PDU.
Playing Russian Roulette
Many racks do not have metered PDUs. This is like playing Russian roulette, since you have no way of knowing if the next server you add to the rack will kill it!
The only way to safely implement a dual-server PS and dual-rack PDU is to never exceed 40 percent of the face rated value of the rack PDU or path. All circuits must always be protected by a circuit breaker. The UL- and NEMA-mandated codes require that you can only safely draw 80 percent of the rated value.
For example, you cannot draw more than 16A from a 20A PDU. This means that in a dual-PDU rack, the entire equipment load should not exceed 16A for the rack. Therefore, each PDU should normally only have an 8A load on it, to avoid a potential cascade overload and resultant rack-level power failure.
As mentioned earlier, even those administrators who do have metered PDUs, do not realize that once they go past the 40 percent power level that they are in danger of have a cascade power failure. Moreover, as servers are upgraded and added all the time, it is easy to see how the exposure continues to increase with no warning, until a problem occurs and then everyone involved is baffled why power was lost because everyone thought they had “redundant” power.
Consider a Metered PDU
I would suggest that if you are fortunate enough to not have had this happen to you already, that you review your rack-level current draw at each PDU. If you do not have metered PDUs, you should consider upgrading to a metered PDU in the near future. If you have many racks, you should consider a metered PDU with remote monitoring (via SNMP and/or Web) that can send SNMP traps to your management software, since it would lower the burden of manually monitoring dozens or hundreds of PDUs.
Bottom line: Make sure that if you are implementing redundancy that it can sustain 100 percent of the load if the other path fails. Review and document your existing load structure, and continue to proactively monitor and manage the load levels on all PDUs, as well as all the other elements of your power path. Changing out PDUs can involve some downtime; however, like any power path work, some downtime may be required if there is no true 2N power path. Take your choice-some planned limited downtime or an unplanned surprise shutdown.