How We Manage Amtrak's Challenging Infrastructure
Amtrak has always been a combination of the new and the old. Formed by combining nine remaining passenger railroads, Amtrak brings with it decades of railroad lore. Although formed by the past, it was fashioned for the future.
Like the company, Amtrak's IT infrastructure embraces the new, while retaining much of the old. As an overview, Amtrak's IT supports the following: TPF (Transaction Processing Facility), which is a mainframe operating system developed explicitly for the fast response time required by the reservation industry; a z/OS (operating system for IBM's z/900 series of large mainframes) for business applications such as inventory management, time recording, etc.; Solaris servers for the e-commerce applications, the Work Management System and HR/Finance applications; and the ever-present Windows servers for applications from the company's intranet to e-mail servers.
As Amtrak's IT has grown, the need to consolidate has introduced the use of VMware servers for virtual systems, Citrix for virtual desktops, SAN (storage area network), NAS (network-attached storage) for shared directions and file systems, and enterprisewide-database servers. To add to the complexity, Amtrak's IT equipment resides at three main data centers and hundreds of remote sites across the country.
The Secret is Standardization
The "secret" to managing such diverse entities is, as you may have guessed, standardization. This means standardization of server builds, software components, hardware, backups and monitoring software, as well as policies and procedures.
Another way of thinking of this is the "boring is good" philosophy. If every time you look at a SQL Server and see that it has the same files on the same drives in the same directories, then you know you have reached the epitome of boredom. And, in the computer management arena, boring is good.
Tools and capabilities may change from server to mainframe, but the procedure is always the same. Here are some examples:
Customer Interface for Support
Amtrak has one (and only one) Help Desk for all user concerns: desktop, applications and hardware. The same software tool captures the data, as well as the "who, what, when and where." (After we solve the problem, we try to determine the "why.")
If the Help Desk cannot resolve the issue, then the same process is followed for all issues. Critical issues are paged out to support personnel and certain customers. Each type of support required is sent to the appropriate "queue" of individuals, based on the application and location of hardware. As the issue is worked, the customer is contacted for problem details and to confirm resolution.
Health of Systems
We use the same monitoring tool on every midrange server in the enterprise. The tool is configured to monitor various items, including system uptime, services that are running, available storage and CPU utilization. Should any event pass the threshold, an alarm is sent to the main console, which is monitored 24x7 in one-hour shifts. An alarm is also sent via e-mail to the system administrator for that system.
The goal here is to be proactive by addressing situations before they become problems. Using one tool across the enterprise allows for one picture of the state of the enterprise for the server environment. While the mainframes use different tools, the concept is the same: Operators are alerted before a problem occurs.
Back Up Strategy
Amtrak again chose the same "boring" solution across the board for all midrange servers-whether they are Windows, VMware guest systems, Solaris, SQL DB (database) Servers or an Oracle Enterprise cluster. This way, we have one report on which to check for errors, one product on which to train people and one process for storing and restoring data.
Once again, as you may have guessed, there is one change control process for all changes. This is the case whether they are for application upgrades, installing new software, adding hardware or routing the network. A weekly change meeting is held with representatives from all the IT walks of life. Each record is reviewed for its impact on not only the requested program, but on others that may be affected. Also, while each record has its own list of required approvers and its own risk category, the process ensures that all items will receive required attention.
Another important note-we have kept our Change process very simple. There are only three risk categories:
1. Emergency (used to fix a critical problem)
2. Standard (planned changes for two weeks in advance)
3. Administrative (tasks that don't change the environment itself, such as adding users or updating train station data that need only a three-day lead time)
Once again, our philosophy is, if everyone has access to the same information-whether they are a customer or manager or support person-then a problem will never have the opportunity to be "exciting." In other words, everyone's response will be more or less relegated to, "Yes, I know about the problem already; tell me something new."
To illustrate, we have status reports that are sent out six times a day-with updates on problems, as well as on the status of our critical systems. Additionally, our internal Web site carries the same status reports 24x7. While this doesn't guarantee that everything is received as routine, it does go a long way towards satisfying our customers. They know when we have a problem and that we are working on it.
We also use a VRU (Voice Response Unit) to share information. If a critical system is not working or a site is down, we provide a message that the user will hear when they call the Help Desk to report a problem. This is but another venue in which to share the fact that we know that there is a problem and that we are working on it. After all, there is almost nothing more frustrating than having a problem and not being able to tell someone about it. Also frustrating is the inability to know if "anybody up there" in charge is doing anything about it.
As I have alluded to previously, we have standardized where possible. We use only Windows and Solaris servers. They are built according to standards (as much as possible, depending upon the application). Also, we exclusively use SQL Server for Windows databases, as well as Oracle running under Unix. This means that when a technician researches a problem, the same techniques can be used almost across the board.
Continuity of Operations
We have pretty much taken the "excitement" out of maintaining and operating Amtrak servers. We use the same customer procedures for support. We use the same tools for backup and monitoring. We use the same change-control procedures. We prevent excited rumors from developing by keeping customers informed. Plus, we have given ourselves a limited amount of different hardware and software to troubleshoot.
Keeping Murphy's Law in Mind
However, Murphy being who he is, there is always the chance that a system will fail-despite RAID and redundant power and redundant-network connections. For those systems we consider critical, we have clustered solutions and different levels of automated failover, depending upon business requirements. That way, even though we still have to address the issue, we are doing it in a way that is transparent to the customer.
Yes, Amtrak still has legacy applications that are twenty-years old. But we also have some of the newest technology such as VMware, SANs and NAS. And, through the methods explained above, we all manage to happily coexist.
Karen Shockley has an extensive IT background that includes all phases of the systems engineering lifecycle. Currently Director of Amtrak's Enterprise Data Centers, she is responsible for the 24x7 operation of over 1,000-midrange servers and mainframes.
Shockley began her career in the U.S. Air Force, learning structured programming methodologies and concentrating on Quality Assurance. After nine years in the Air Force, she moved to SAIC, where she led an integration effort for a Corporate Executive Information System.
Shockley led Tiger Team efforts at Amtrak, was the mainframe liaison to IBM, managed the operational effort for an SAP implementation and has achieved her Project Management Professional, INCOSE certification and ITIL Foundation certification.
Shockley has a degree in Physics from Miami University of Ohio and a Master of Science in Computers from the University of Oklahoma. She has published articles on Data Warehousing, Meta Data and Customer Relationship Management. She can be reached at firstname.lastname@example.org.