With the storage capacities of enterprise-class RAID units expanding and backup windows shrinking, the common procedure of running full tape backups is becoming increasingly difficult to accomplish.
eWeek Labs and engineers from Veritas Software Corp.s labs, in Mountain View, Calif., recently completed a high-end backup performance test to see if—and determine how—enterprise-class backup systems can be pushed to meet increasingly demanding corporate needs. eWeek Labs worked with Veritas George Winter, product marketing engineer for backup, and his staff.
Although hard drive and RAID systems are getting faster and larger, tape technologies have not, for the most part, kept pace. As a result, organizations must design highly parallel, increasingly complex backup infrastructures that allow multiple tape drives to receive backup traffic simultaneously. Click here to see chart of a Complex Backup Setup or a Simple Backup Setup
eWeek Labs and Veritas designed a system that was able to back up a 2-terabyte data set in less than 1 hour. This would be overkill for most organizations storage needs (not to mention budgets), but the issues we encountered in attempting to balance server, disk, network and tape hardware to maximize speed apply to any situation where data needs to be copied to tape in the shortest possible time period.
Organizations that have an even smaller backup window than 1 hour will need to investigate hardware-based drive-to-drive backup options.
Know Your Data
An effective backup strategy must be tailored to the type of data being backed up. Highly compressible data, such as general user documents or Web site data, can be backed up much more quickly than less compressible data such as database files. To mimic a real-world situation, we backed up a hot Oracle Corp. database, which had a compression ratio of roughly 1.8 to 1.9.
Backing up highly compressible data will require a greater investment in disk and SAN (storage area network) bandwidth (because the data is sent to the tape drives in uncompressed form) and a smaller investment in tape drive hardware. Backing up less compressible data will require the opposite: lower transport bandwidth and tape drives that have faster write speeds and faster tape changing speeds.
The center of the test backup network was the database server, a Sun Microsystems Inc. Enterprise 6800 with 24 750MHz CPUs and 25GB of RAM. The server was running Solaris 8 and Oracle9i (188.8.131.52) in 64-bit mode.
During tests, we did not spend more than 79 percent of CPU cycles—an important performance metric because we wanted to make sure the server was not a bottleneck for this test. We chose this server setup not only because the Sun-Oracle combination is a popular choice among enterprise IT managers but also because we felt confident that it could take the intense I/O strain we were placing on it.
Whatever the server used, its key that the hardware have the bus and network bandwidth to get data to tape drives fast enough to meet backup window requirements.
The Enterprise 6800 had to forward database data to our tape libraries at a rate of roughly 550MB per second, not to mention receive data from Fibre Channel-attached disks at the same data rate. Any server-induced lag would have prevented the tape drives from writing data at optimal rates, significantly reducing performance.
The lesson here: Analyze the entire data path when designing backup systems. Youll be wasting money on high-end tape and disk hardware if the systems and network between them are slow.
For primary storage, we used two Storage Technology Corp. StorageTek 9176 RAID units, each containing 80 Fibre Channel hard drives (36GB per drive, spinning at 15,000 rpm). To achieve maximum throughput, we configured the units in RAID 0+1 configuration. This provided redundancy coupled with RAID 0, which provides the fastest possible storage system read and write times.
For many budget-conscious organizations, RAID 0+1 for the entire storage system is out of the question because it requires far more storage hardware than RAID formats such as RAID 5. However, using RAID 0+1 in certain speed-sensitive parts of a larger system makes sense. For example, storing database data on a RAID 5 array but storing the log on RAID 0+1 is a good compromise between cost and speed.
We used two StorageTek L700e tape libraries, each equipped with 12 9840B tape drives. We chose to go with the 9840B tape drives primarily for speed: Their native uncompressed data rate of 20MB per second is best in class. The 9840B tape drives also have the ability to load a tape in roughly 5 seconds. In contrast, Super digital linear tape drives take more than 1 minute to load, and the process of reloading 24 drives in that media would have made it difficult to meet our 1-hour backup requirement.
The biggest drawback to using the 9840B drives is that their native capacity of 20GB per cartridge is vastly inferior to that of tape drive formats such as LTO (linear tape open), which has a capacity of 100GB per cartridge. However, because the 9840B tape drives are faster and load more quickly, we figured they would perform better in our tests, even with the tape reload requirement.
Organizations with larger backup windows or less data to back up will likely find LTO gear a more suitable option, however.
All the storage components used in this test were linked with two 3200 and four 3800 2G-bps Fibre Channel switches from Brocade Communications Systems Inc. The Enterprise 6800 server was configured with 31 Emulex Corp. LightPulse 2G-bps Fibre Channel HBAs (host bus adapters). Based on previous performance tests, we felt that each HBA could dependably deliver 140MB per second.
The I/O boards in the Enterprise 6800 had eight 66MHz PCI slots, and we dedicated all these slots to the HBAs connecting our server to the tape libraries. We placed the rest of the HBAs in the remaining 33MHz slots, and these adapters connected the RAIDs to our server.
To make sure none of these HBAs was overloaded, we hooked only seven drives to each adapter. Our Fibre Channel SAN was set up with zoning to isolate traffic to each of the HBAs. Our primary concern for this test was performance, so we did not set up redundant pathing for the Fibre Channel links.
Fibre Channel zoning is similar to virtual LANs in the IP world. Here, we created zones within the Fibre Channel switch that focused traffic to the right places. For example, an HBA was allowed to see only the tape drives and disk drives to which it was given access.
The key takeaway here is that fast backups depend on the right choice of components and balancing hardware resources to ensure that tape drives are writing data to tape as fast as physically possible.
In fact, we did very little software tuning for this test. For backup software, we used Veritas NetBackup 4.5 DataCenter Master Server to manage the system and Veritas Database Edition for Oracle on Solaris to back up the database server. The Oracle database automatically took care of most of the settings, and it set up 24 channels that distributed the backup load equally to the 24 tape drives.
A story on backup performance would not be complete without details on the restoration process. No matter how fast, how big or how much is spent on a backup implementation, none of it matters if you cant restore the data accurately and quickly.
Restore operations generally take more time to complete than backup operations because there are significant differences in hardware workloads. It took us 1 hour to back up 2 terabytes of data and 2 hours and 27 minutes to restore it.
During the backup process, we had the benefit of two RAID units that could send data to the tape drives at a rapid rate. If one RAID unit was tied up, the tape drives could get data from its mirrored unit. This allowed the tape drives to work at their optimal levels, and it eliminated a fair amount of latency.
When we shifted over to the restore part of the test, the RAID units were constantly writing data, not reading it, a much more expensive operation because all writes have to be done twice. (Reads need to be done only once.)
While restoring data, the tape drives were only able to write to a single RAID unit, which in turn was responsible for bringing its mirrored pair back up-to-date after receiving the restored data from the tape drives.
We probably could have accelerated the restore process by splitting the restore jobs evenly between the RAID units. However, we wouldnt recommend this as a general practice because it would have resulted in two half-populated RAID units. In an emergency situation, its far better to have one good RAID unit online and ready for transactions than two half-built RAID units waiting to synchronize.
Senior Analyst Henry Baltazar can be reached at firstname.lastname@example.org.
Also in this package
Backup Window Shrinks, but Not the Cost