Our school system provides just enough money to keep the system running and updated; any new systems or services are paid for on an as-needed basis. There are no test systems, evaluations or network simulations. If we get something it is implemented and issues that arise are on live systems.
We started looking at SANs (storage area networks) about seven years ago. We began looking at Winchester systems. We were looking at fully redundant meshed switches and SAN servers to get the most reliability, but with the technology available at that time, there was always a single point of failure that we were unable to resolve. However, that approach was too costly, so we continued to stick with the in-server disks; they were faster than running across a SAN and a lot cheaper.
Management suggested trying the EMC Clariion 100 SAN package when it became available. I think it was something like 1TB, and the price seemed like a steal back then-$15,000 for a Fiber Channel switch, some Qlogic cards and the EMC Clariion SAN server. After purchasing the package I noticed we needed fiber optic cables, which did not come with the system. We paid EMC $120 for each; additional parts were not included in the package, such as rack-mounting hardware that could have cost up to another $200.
We used the SAN for housing Quorum disks for clusters and that seemed to work fine. Eventually, I was asked to move our GroupWise e-mail server to the new SAN and cluster it. I was concerned about the performance, but I moved it anyway. The performance was worse then local disks, but it wasn’t so bad that it was not worth the redundancy with the clustered server.
After two to three years, GroupWise performance degraded, and cluster nodes lost communication with each other through the Quorum disk and/or network. The Fiber Channel infrastructure checked out well. I checked the Clariion SAN server and everything looked good.
I assumed the e-mail database was just too big for GroupWise to handle. I first forced all e-mail over 1 year old to be archived into the users’ home directories, but this had little to no effect. Before I started walking down the uncharted road of migrating users to another post office, I checked the storage stats on the server and noticed the peak disk writes to the SAN were extremely slow, like 1.5MB/s to 7MB/s. This gave me my first clue as to what was happening.
I tested the back-end SAN. Performance turned out to be terrible, but we were stuck now. We had no servers to move the e-mail server to and no SAN to back up this SAN. I was already looking for a faster backup solution at this same time. I came across Pillar Data Systems for disk-to-disk backup and sent them a request.
I knew we were living on borrowed time, but I was also very wary of getting another terrible SAN product. What I found was that Pillar built all the redundancy that we originally wanted and added speed and ease-of-use to the package for about the same price as the Clariion package. This price was on a per-gigabyte basis.
The Pillar solution has three components: the Pilot, the Slammer and the Bricks. The Bricks are the actual storage units; they house two RAID 5 disk arrays. Each array has a hot swap spare built in, as well as dual redundant controllers and dual redundant power supplies. This means you could lose up to four disks, one controller and a power supply in a Brick and still be running. The Slammer controls access to the Bricks. It has redundant controllers, as well as redundant power supplies. The Pilot is the management portion of the system, and it is fully redundant in power and electronics, just like the other two.
In the end, we needed better performance as well as reliability. Pillar Data gave us the reliability, but I was still very unsure of the performance. As part of this SAN upgrade, my boss pushed me to move to blade servers and virtualized servers. We bought into the IBM Blade server with five blades, about half the cost of the Hewlett-Packard Blade Center. I was able to test VMware and XenSource products. Being a Novell man, I was a bit disheartened to find that Xen just wasn’t where VMware was, so even though the extra cost hurt, we went with VMware as our virtual server platform on the blade servers.
I installed VMware ESX server on all the blades, tied them together with Virtual Center and proceeded to set up the SAN with the Pillar Axiom system. I installed about two servers per blade before our e-mail server was in its death throes, and I made the decision to move the e-mail server to a virtual machine.
I was very concerned about the SAN performance but also the server performance, even though each blade had dual quad-core processors and 8GB of RAM. VMware only allows you to assign one processor to a NetWare machine. I did a backup of the e-mail server, which took about eight hours, after which the disk shut down and the server took a poison pill (that was close).
It took one hour to restore the files. So at 3 a.m. our e-mail server was moved from a server with two dual-core Xeon processors with 4GB of RAM and a dedicated Qlogic 4GBs FC channel card to a virtual server with a single processor, 2GB of RAM and shared Qlogic 4GBs card, and it ran like a bat out of hell! To give you an idea, sometimes switching to a user’s in-box could take up to one minute; now it takes less than one second. Processor utilization on the server maxes out at about 800MHz, during very rare spikes, and hovers around 300MHz when we are fully loaded with 750 active users.
What this boils down to is a very responsive SAN network. With the old servers the processor utilization was a lot higher and file queues were getting backed up. I suspect GroupWise was either having a hard time writing to disk, or it was trying to verify data after being written and that overworked the server. In a sense GroupWise is a nice test for disk usage; it uses a ton of small files and it accesses them like crazy.
We still have one functioning Clariion SAN Server, the Clariion 150 series. That model seems to work better and shows a good comparison between the functionality of the two systems. With the move to the new system, we gained important capabilities that the older system lacked:
Quality of Service: Even though there are only four settings on the Pillar system-archive, low, medium and high-this makes a world of difference allocating storage between low- usage servers and high-usage servers.
Logical Unit Number Mapping: This is more powerful than you might imagine. Moving virtual machines between physical servers makes this ability very powerful.
Management via multiple methods: Very nice. If this was in the Clariion System we may have found the problem with the SAN well before we experienced the near disaster of losing all our data.
Data Protection: On top of the RAID 5 and Hot Spare there are further options for double and triple redundancy.
In the end we may have gotten lucky, or maybe we were helped by lessons learned.
Going with the absolute lowest price may not be the best way or the cheapest. I would suggest that anyone looking at SAN products figure out what you want, then proceed to find out what it will take to get that. It may well be that it can be managed within your budget.
About the author: Brett Littrell is network manager for the Milpitas (Calif.) Unified School District, which manages about 10,000 students and 1,000 employees. Its computer network has around 2,000 to 2,500 client computers with three technicians to maintain them all. He can be reached at [email protected]