For the national aeronautics and space administration, efforts to obtain sufficient computing power have long been a moving target. NASA is, of course, best known as the government agency involved in space exploration. But the agency also handles a multitude of research projects related to aerospace engineering, nanotechnology, DNA sequencing and weather patterns here on Earth.
NASA has long been a leading customer of Silicon Graphics Inc., of Mountain View, Calif., and other supercomputer manufacturers. However, the agency never seemed to have enough computing power or storage space for all the hundreds of staff scientists engaged in projects ranging from long-term weather forecasting to examining the environmental effects of the space shuttle.
In 2004, in a bold move to reverse that long-standing shortage, NASA placed a groundbreaking order with SGI for a system that would boost NASAs total computing power tenfold. Dubbed Project Columbia, NASA sought the worlds most powerful supercomputer—and quickly. (The resulting supercomputer is currently ranked as the worlds third-largest.)
“We needed ways to improve our high-end computing power,” said Bill Thigpen, Columbia project manager for NASA. “It had gotten to the point that when we had an emergency research project, we would not have the computing capacity to handle it. We would have to shift all of our capacity from other projects and throttle back a lot of ongoing studies.”
In July 2004, SGI and NASA officially launched Project Columbia, to consist of 20 separate systems each containing 512 processors. The challenge: While bigger is often better, especially for the kind of advanced aerospace research being conducted by NASA, building bigger is not as simple as just adding more processors.
“The larger you make a system, the more potential it has for breakdown,” said Jill Matzke, SGIs high-performance computing marketing manager. Matzke said she was especially concerned about performance slippage that she had observed on other massive machines.
Meeting NASAs tight deadline to have the machine delivered in less than a year required round-the-clock labor on the SGI manufacturing floor, coupled with strong collaboration among NASA; SGI; Intel Corp., of Santa Clara, Calif., which supplied more than 10,000 Itanium processors; and Computer Sciences Corp., of El Segundo, Calif., which worked as a subcontractor to SGI, integrating large parts of the machine being built.
“We took a different approach to this procurement,” said Thigpen. “NASA and Intel and SGI worked together as a team.”
From a design standpoint, SGI addressed the major performance concerns by incorporating its Altix systems to interact as only 20 single operating systems—each supporting 512 processors—rather than as an enormous system of 512 independent processors working in one unified effort. It is a structure that dramatically simplifies supercomputing.
The massive project sent the SGI production floor into a 24/7 schedule of operation. As workers at SGI completed assembly of individual nodes, they shipped them to CSC, which worked on-site at NASAs Ames Research Center at Moffett Field, also in Mountain View, to attach all the parts into a single supercomputer measuring approximately 10,000 square feet.
Like the team at SGI, workers at CSC logged a lot of overtime to keep the project on track. Alan Powers, high-end computer lead at Computer Sciences, credits a highly motivated and coordinated team for meeting a schedule in which new parts of the machine would be up and running within three to five days of delivery.
When the computer was completed four months later, it came in not only under deadline but also under budget. Powers estimated that the amount of money spent on the Columbia project was about one-eighth the amount spent to build some of the worlds other major supercomputers, such as the Earth Simulator in Japan.
Along with Columbias tight time frame, Powers credits SGIs use of commodity memory systems, Intel chips and a Linux operating system for keeping costs down relative to other SGI projects that had used SGI technology exclusively. Nonetheless, he said certain proprietary SGI technology, such as its NUMAlink Interconnect Fabric for supercomputers, was instrumental in boosting Columbias efficiency.
“It is hard to imagine the work being done by NASA today versus five years ago,” said Powers, who notes that Columbia has shortened the time for some simulation projects, such as worldwide climate modeling, which has gone down from several months to a week or two.
For the moment, however, Thigpen said NASA is delighted with the vast increase in power it has realized.
“This is an improvement of a factor of about 10 over our previous computing power, and it was achieved in four months,” said Thigpen. “Ive gotten multiple letters from scientists who say that, with Columbia, they are doing things that couldnt be done before. For instance, in climate modeling, there is a huge difference between being able to look at a half of a degree and an eighth of a degree. Columbia provides for much more precise modeling.”
Andrea Orr is a San Francisco-based freelance writer. She can be reached at andrea_orr@sbcglobal.net.