Technology has evolved to the point that we now have collision avoidance systems for automobiles. So why don’t we have mistake avoidance systems for data managers? Many of the issues that compromise data center performance stem from mistakes that could have been avoided. It’s often the little things that take you down, but that “small” oversight can cause an outage that can bring a company to its knees.
There are hard lessons to be learned in the data center and most of us will endure a few of them during our careers. What’s ultimately important is how we improve from these experiences. It is imperative that every data center manager follow what I call the golden rule of data center management: Whenever possible, learn from the mistakes made by data center managers everywhere. There are five steps you can take to do this.
Step No. 1: First, have the self-confidence to admit your mistakes.
Owning up to errors isn’t an acknowledgement of failure. On the contrary, it’s an indicator that you want to be proactive, learn more and operate the best possible facility.
Step No. 2: For an ongoing learning exercise, put your staff in “safe” situations where mistakes can be made without consequences.
Since it’s difficult to conduct “what-if” scenarios in a live facility, I recommend using scripted role-playing and reenactments, followed by group discussion. This is a great way to develop team chemistry and foster conversations that can potentially resolve issues before they occur.
Step No. 3: Many data centers have found it helpful to create a “lessons learned” binder for their organization.
This binder contains descriptions and solutions to procedures, processes and equipment failures that may have happened at their facility or elsewhere. It serves as a reference guide for the team and provides guidance for new employees. I recommend reviewing new inserts to the binder during quarterly meetings.
Step No. 4: If you are building a new data center, I strongly recommend including a commissioning process where a third-party consultant, separate from the design and contractor teams, is involved at all stages.
This provides a set of fresh eyes that can identify potential problems and an objective source of insight to best practices. During construction, the commissioning agent can offer expertise on decisions regarding electrical conduit, proper equipment sizing and more. They can help you value engineer the project as it is moving along. When the data center is completed, the commissioning agent is a critical evaluator during the testing process. As part of the commissioning process, I suggest asking your commissioning agent to help you develop a learning exercise that can give your staff the chance to kick the tires of all systems without penalty.
Step No. 5: Finally, be courageous about making changes.
It might involve expense and pain, but asking for more money or time is far better than the potential consequences.
16 Most Common Data Center Management Mistakes
16 most common data center management mistakes
To help spur some preemptive thinking, I’ve compiled the following list of the 16 most common mistakes made in data centers all over the world. This list is in no particular order and it’s by no means exhaustive. However, it demonstrates how often we forget some of the basics.
Mistake No. 1: Not measuring power usage and forgetting to consider the comparative costs of electricity for the data center.
Remember, the data center is a huge bull’s-eye when it comes to operational expense. Conservation and efficiency are always on the CFO’s radar so they’d better be on yours as well.
Mistake No. 2: Not designing for modular growth.
You don’t want to have a beautiful, state-of-the-art data center that’s underutilized. It’s more efficient to build out modular chunks within the “shell” of the building as your organization grows.
Mistake No. 3: Not taking advantage of data center design and facility best practices to help your data center reduce costs and run more efficiently.
Believe it or not, I’ve seen data centers that weren’t applying well-documented practices such as hot and cold aisles.
Mistake No. 4: Believing that there’s only one way to design or maintain a data center.
The standard model isn’t always the best, most cost-effective solution. Data centers are generally similar, but how they are put together should reflect the philosophy of the company and what works best for that specific location. Create your own standards that provide the results you want to achieve.
Mistake No. 5: Not Hiring the Correct Personnel for the Job
Mistake No. 5: Not hiring the correct personnel for the job.
Once you’ve ensured that the right people are on board, make sure their roles and responsibilities are clearly defined and understood by everyone.
Mistake No. 6: Not providing the proper training and guidance to your staff to react to issues and concerns that can be avoided.
The reenactments and role-playing scenarios I described earlier are extremely useful strategies in helping employees become better equipped at dealing with unexpected challenges.
Mistake No. 7: Not developing a Critical Environment Work Authorization (CEWA) process.
Any work to be performed in the data center should include a step-by-step description of what the job entails, the impact on company operations, safety measures to be taken and other critical details. This forces you to consider every move that will be made and ensures that a consistent CEWA process will be followed to prevent injuries and outages. Make sure all field personnel and contractors receive a copy of the process and thoroughly understand its contents. Each CEWA process should have a risk level associated with it. At our company, we designate work as having a risk level of 1-4. Work that involves higher levels of risk requires executive authorization.
Mistake No. 8: Not alerting your external or internal customers that work is being performed in the data center that could cause an outage.
Communication is critical to ensure that all parties, especially customers, are informed well in advance so that they can adequately prepare and ensure that their business operations are not impacted.
Mistake No. 9: Not taking advantage of free cooling in climates that will permit it.
Any time you can reduce dependence on mechanical cooling, the better. By the way, from a geographical perspective, this solution is more viable than you might think. Free cooling works when outside temperatures are about 65 to 66 degrees or lower. For example, with our data centers in Atlanta, we can shut off our units at night and go into free cooling mode.
Mistake No. 10: Lack of Communication Between IT and Facilities
Mistake No. 10: Lack of communication and coordination between IT and facilities in pursuing corporate goals.
An example of this is the development of a data center energy efficiency program that only involves facility recommendations. There are plenty of green IT strategies and opportunities that should be incorporated as well.
Mistake No. 11: Placing an emergency power off (EPO) near an exit door without proper signage and secure casing.
If you ask any data center manager about their EPO, most of them will recall a story about an employee or cleaning crew member who accidentally shut off the power to the facility. Make sure the EPO is properly encased with an alarm.
Mistake No. 12: Designing and building a data center to an “uptime tier” level and not maintaining it to that level after it is built.
You might design your center as a Tier 4 facility today, but if you don’t maintain and upgrade your equipment and systems over time, you likely won’t be able to sustain Tier 4 performance. Issues such as equipment obsolescence will create outages.
Mistake No. 13: Confusing network latency with application latency.
If moving your servers adds 50 milliseconds of network latency, that doesn’t necessarily mean that you will add the same amount of latency to application response times.
Mistake No. 14: Presuming that all of your facility components are being addressed.
I’m referring to the issues that often go unnoticed such as electrical grounding, static electricity or personnel safety, to name a few. Never take anything for granted. Just because you have an uninterruptible power supply (UPS), for example, don’t overlook the performance of lightning protection and brand circuit monitoring.
Mistake No. 15: Creating a False Sense of Security
Mistake No. 15: Creating a false sense of security through overreliance on environmental monitoring.
Don’t just rely on equipment monitors. You need to walk around the facility every day and use sight, sound and smell to determine if things are amiss. I once heard a strange sound coming from a UPS that I’d never heard before. It turned out there was something fatally wrong with the device. Good data center management is a job for all of the senses!
Mistake No. 16: Assuming that all network connectivity is equal.
Finally, be sure that your service provider partners can ensure the availability and reliability of the data center network (which allows your customers access to their critical data and applications), as well as provide performance-based service-level agreements (SLAs).
Randy J. Ortiz is Director of Data Center Design and Engineering at Internap. A 20-year industry veteran, Randy has overseen the design and construction of more than one million square feet of data center facilities worldwide. He can be reached at [email protected].