Data at the Ready

Clustering software helps center untangle DNA mysteries.

The DNA sequence of the entire human genome was mapped as of April, but the battle to keep that massive amount of data available and protected rages on at places such as The Genome Sequencing Center at Washington University Medical School.

Established in 1993 with a grant from the National Human Genome Research Institute at the National Institutes of Health, The Genome Sequencing Center contributed about 25 percent of the completed human genome information. In the course of this work, the GSC experienced what all major scientific research efforts and most commercial entities are going through these days: skyrocketing data proliferation. The GSCs data store grew from somewhere in the gigabyte range to some 8 terabytes over the past few years.

High availability to data is crucial to ongoing work, according to officials at the center, and data loss is unacceptable, given that the costs of the research and investments in computational resources that go into mapping DNA are just too high.

To put that into dollar terms, according to Kelly Carpenter, senior technical manager at the center, in St. Louis, every piece of DNA mapped translates to about $200,000 in initial technology investment and subsequent upkeep. "You look at any file folder on the system, and, I figured it out, it comes out to about $200,000 for each," Carpenter said. "You lose that, you lose $200,000."

To protect those six-figure file folders, the center turned to Oracle Corp.s Oracle9i RAC (Real Application Clusters) managed with the Database Edition of Veritas Software Corp.s Advanced Cluster heterogeneous file system software. This cluster software is the foundation for Oracle RAC environments that run on Solaris or HP-UX. The GSC put this software at the center of a new Fibre Channel storage area network running on Solaris and Linux operating systems.

The motivation behind these technology choices, Carpenter said, was to provide a high-availability environment that would enable the research center to cut costs, beef up performance, lower management costs and stay on top of the massive data growth associated with gene mapping.

The GSC is now running two Sun Microsystems Inc. Sun Fire V880 servers in an array, each with four processors, and is in the process of migrating off an older cluster that consists of two Sun E3500 servers.

Previously, the GSC used a high-availability Oracle HA Cluster platform. In that type of parallel-server setup, one server runs the Oracle9i database while another sits idle, waiting to take over if the production version fails. Such a setup is expensive, Carpenter said, given that half the server resources involved are seldom used.

Another factor contributing to the high cost of Oracle HA is its difficulty to set up and administer, Carpenter said.

The failover capabilities of RAC with Veritas clustering software are also much smoother than those of Oracle HA, Carpenter said. With Oracle HA, if a query dies when a database instance goes down, researchers would be "dead in the water" until another Oracle instance came back up—which could take minutes, Carpenter said. With RAC, its "more like a bump," he said.

"For the client, its cool," Carpenter said. "If the client is doing a query from some nice pretty GUI, if the physical server its talking to dies, the client can figure out it died and will automatically reissue the query from the beginning on another server, without any interruption. Depending on how long the query is supposed to run, something that normally takes a few seconds to run, for example, will run a little slower. But by the time they ask, Is the server down? they say, Oh, wait! Its done."

The GSC also uses Veritas NetBackup software with FlashBackup and Shared Storage options to protect and restore some 285 million files. Since the migration to the current cluster and storage setup in June, the GSC credits the FlashBackup option with improving backup performance from 24 hours to 4 hours and with reducing its catalog from 150GB to 30GB.

Those gains are significant. But one of the main points of installing Veritas Advanced Cluster software to handle the Oracle RAC setup comes down to ease of use, Carpenter said, with its GUI that allows database administrators to easily fail over servers without having to resort to command lines to bring things down. Obviating the need to enter big chunks of commands into command-line interfaces brings total cost of ownership down by reducing ongoing management costs, Carpenter said.