Compute Farms Yield a Fine Crop of Data

Cheap compute cycles satisfy the need to number-crunch ever-bigger jobs

The human genome, contained in a set of 23 chromosomes, is estimated to contain some 3.16 billion nucleotides. The scientists who dig through that mountain of data to discern the mysteries of life demand an equally massive amount of raw, number-crunching power.

Luckily for Rainer Fuchs, the commoditization of PCs and the ubiquity of TCP/IP as a networking standard have enabled him to deliver enough processing brawn— in the form of a so-called compute farm—to keep his scientists happy at Biogen Inc., a biotechnology company in Cambridge, Mass.

Fuchs, Biogens senior director of research informatics, is the master of a compute farm that Biogen opened earlier this year to unleash hundreds of CPUs onto gigabytes of human genome data thats been parceled out into bite-size pieces. And what a rich harvest this farm will yield: Scientists will use the computing power to sift through sections of data over days, weeks or even months to find diseases lurking in our DNA, as well as clues to new generations of biotech drugs that will fight them.

Compute farms are close relatives of server farms, which often underpin e-commerce and Web-hosting applications. But, while server farms are intended to process a large number of short transactions, compute farms typically process a small number of large jobs that can be easily split into parallel processes. The basic components of a compute farm consist of a bank of PCs, often two- or four-way boxes running Pentium or low-end RISC processors. Biogens farm features dual-processor Pentium PCs running Red Hat Inc.s Red Hat Linux.

Highly specialized biotech companies such as Biogen are pioneering the development and use of compute farms. Thats at least partly because, like Biogen, many have been pushed by the massive data demands of the human genome project. But technology managers and analysts say these processing powerhouses arent only for niche markets.

Indeed, any IT operation that needs to crunch large volumes of data can benefit from a farm, including banks performing financial analyses, insurers doing risk analysis and large retailers data mining for hidden customer trends. Sun Microsystems Inc., as one example, has launched a major marketing push to cultivate farm sales. A spokesman said the company has attracted a number of insurance and finance farm customers, but it hasnt yet received the go-ahead to release their names.

Its easy to see why such enterprises would be attracted to the concept: Compared with using traditional workstations or supercomputers, compute farms can reap bumper crops of savings in both time and money.

"The jobs were doing now traditionally would have taken about two years to run on the 20-CPU workstations we had before the compute farm," Fuchs said. "Its a throughput issue. Multiprocessor workstations are great if you have complex, individual jobs to run, but theyre not efficient when youre running hundreds of thousands of simultaneous jobs."

And as far as price goes, compute farms cant be beat: They cost only about one-tenth the price of a mainframe. Biogen compared farm price and performance with traditional Unix servers and multiprocessor workstations and concluded the alternatives were many times more expensive than the farm, for which Fuchs declined to cite figures other than saying it was more than $500,000.

"We believe that we are getting considerably more bang for the buck with the farm," he said.

None of this is going unnoticed by vendors. The compute farm market is attracting attention from a wide range of product and service providers, including Sun, Compaq Computer Corp., IBM and Silicon Graphics Inc.

In February, Biogen brought on- line 150 computing nodes composed of dual-processor Pentium PCs managed by a Sun server that acts as the central point of entry for researchers needing access to the behind-the-scenes processing power.

Software was key in getting Biogens farm up to its potential, since it provides a crucial bridge between the compute cycles necessary in parallel processing schemes.

"Providing compute cycles is easy. The real challenge is the data distribution problem," said Michael Athanas, director of scientific computing and development for Blackstone Technology Group Inc., based in Worcester, Mass., the systems integrator that helped Biogen get its farm up and running.

With the right load balancing software, a collection of loosely coupled PCs can achieve about 98 percent of their combined theoretical efficiency, Athanas said.

Compute farms may cost so little because they rely on low-end hardware and software. "Theres a roll your own aspect to compute farms," said Michael Swenson, an analyst with market researcher International Data Corp., in Framingham, Mass. "People can just buy a bunch of PCs and then use Linux clustering software for close to free."

Another attraction: Compute farms are highly scalable. Rather than having to replace an expensive minicomputer or supercomputer over time, growing companies starved for processing cycles need only to plug in another rack of PCs.

The importance of having a central server is that it means researchers dont have to log in to a computing node directly, an advantage because "most scientists are not Linux experts," Fuchs said. Platform Computing Corp.s Load-Sharing Facility performs load balancing tasks.

If it all sounds like an incredibly simple setup for handling mind-bogglingly complex computation, thats an accurate take. After all, the point of compute farms is to make it easy to get big yields from massive data processing jobs.

"Its a very generic Intel-based platform," Fuchs said. "As long as [an application] runs on Linux, we dont have to parallelize the underlying code. We write wrappers around the applications, and the wrappers break the processing jobs into small pieces. The central server then just collects the results."

Talk about collecting the low-hanging fruit.