Supercomputers for the Masses?
Supercomputers for the Masses?
Ten years ago, supercomputers were multimillion-dollar systems usually used for massive projects, such as modeling Earths climate or nuclear reactions. Today, they are called HPCCs, or high-performance computing clusters, and they are basically free when built from spare PCs. More important, they are quickly becoming suitable for mainstream enterprise computing.
HPCCs look completely different from traditional supercomputers: They are fan-cooled, not water-cooled, and they sit in racks and use off-the-shelf components. And while the inventor of supercomputers—Cray Research Inc.—may have cranked out only two or three computers a year a decade ago, companies including Dell Computer Corp., Red Hat Inc. and Microsoft Corp. are now building hundreds of postmodern supercomputers at a time.
The changes in supercomputing can be seen most clearly in academia, where the New Age supercomputers are commonly used.
eWEEK Labs recently visited Stanford University, in Stanford, Calif., which was setting up a 300-node cluster comprising Dell systems running Red Hat Linux. The goal is to use the cluster at Stanfords Bio-X—a massive, state-of-the-art facility funded predominantly by Jim Clark of Silicon Graphics Inc. fame. The role of Bio-X is to bring together the different sciences—including engineering, physics, medicine and biology—so researchers can better share resources, planning and data.
Stanford began building its cluster late last year, with help from Dell and Intel Corp. By last month, it was tuning the cluster to compete for a spot on the Top 500 Supercomputers list, a directory of the most powerful supercomputers in the world published biannually by the University of Mannheim, in Germany; the University of Tennessee; and the National Energy Research Scientific Computing Center.
Ironically, just a few years ago, the Top 500 list comprised mostly SGI systems based on Cray technology. Now, there is just a sprinkling of Crays in the mix.
The fastest Cray on the list is at No. 39, clocking in at 1,166 gigaflops—nearly a thousand times faster than a Cray Y-MP circa 1988. Interestingly for the enterprise, the performance of the No. 39 Cray system, which is used by the government for unknown but probably defense-related modeling, is dwarfed by systems running Red Hat Linux that are far less expensive to build and operate. The fastest Linux cluster, run by the Lawrence Livermore National Laboratory, clocks in at nearly 6,000 gigaflops.
Stanfords original goal was to place in the first 70 of the Top 500. However, after the system was built, Steve Jones, architect of the Stanford clustering project, said the best he hoped for was a spot in the first 200 in the benchmark. (The numbers are still being crunched, but eWEEK estimates that the Stanford system will come in at about 170.)
Although low-cost computers can be used in a cluster, the network switching fabric has a significant impact on performance. Because of cost concerns, Jones was forced to use a 100BaseT (Fast Ethernet) network backbone instead of the far-faster Gigabit Ethernet fabric. "The switching fabric has a huge impact on our placement for the Top 500 list," said Jones. "Due to costs, we sacrificed network speed in the beginning. Replacing the switching fabric will put us where we should be on the list."
The fastest high-performance clustering interconnects are devices such as Myrinet, made by Myricom Inc. However, these interconnects are usually expensive—about $1,000 a pop (or $300,000 for a typical HPCC) and $100,000 more for the Myrinet switch. This is too pricey for most academic concerns, but if Stanford had gone with Myrinet, it could easily have jumped up more than 100 places on the list. As it stands, Jones said he will most likely upgrade the switching fabric to Gigabit Ethernet by the fall and run the benchmark again in November.
So why did Jones and Stanford even want to participate in the benchmark test, knowing that the eventual upgrade to a new switching fabric would change its position so dramatically? Jones said running the benchmark helped tune the cluster, providing performance gains that are already benefiting Stanford scientists and researchers.
Clusters make only marginal sense in the enterprise right now, with specific instances in which they can be used. Reza Rooholamini, director of engineering for operating systems and clustering in Dells product group and head of Dells clustering group, said HPCCs are gradually moving out of academia and into the enterprise and that there are three main commercial areas of interest right now: oil exploration; bioinformatics; and the automotive industry, for use in crash test simulations. "The applications are typically technical applications," said Rooholamini, "but the organizations that use them are commercial, money-making businesses."
Rooholamini, based in Round Rock, Texas, pointed to some problems with running enterprise applications. For one thing, it is difficult to run database applications—the core of enterprise computing—on a cluster because of issues related to distributed queries. IBM and Oracle Corp. have cluster-capable databases (IBMs DB2 and Oracles Oracle9i RAC), but it is not an easy task to set them up, said Rooholamini. And, generally speaking, it is nearly impossible to take existing applications and recompile them as message-passing applications that take advantage of computing clusters.
It may be easier to use the so-called grid engines to distribute application workloads. Microsoft, for one, is working to combine the grid and clustering technologies. Greg Rankich, product manager, Windows Server Product Management Group, in Redmond, Wash., said Microsoft is focusing at least some of its efforts on distributing business solutions—in part because of customer demand for consolidation, but also to take advantage of spare CPU cycles.
Rooholamini agreed that grid and HPCC technologies are merging. "We look at grid [computing] as an evolution of HPCCs," said Rooholamini. "From a technology perspective, if I design my HPCC and distribute my compute nodes across a larger geography, then I am solving and incorporating things that are necessary for a grid. If I can eliminate latencies in the grid, then I have tackled some of the obstacles."
However, Rooholamini estimates it will be another three years before we see grid-aware applications.
Labs Director John Taschek can be reached at firstname.lastname@example.org.