Storage Guru: Q&A With Garth Gibson

The man who brought you RAID is working to help storage keep pace with high-performance computing demands.

Garth Gibson, one of the most respected thought leaders in the field of data storage, is chief technology officer at Panasas, in Fremont, Calif., and an associate professor of computer science and electrical and computer engineering at Carnegie Mellon University. Gibson received a doctorate in computer science from the University of California at Berkeley in 1991. While at Berkeley, he did the groundwork research and co-wrote the seminal paper on RAID, now a checklist feature for storage products.

Panasas was founded in 1999 and employs 125. It has supplied a cutting-edge, clustered storage system for a new supercomputer called Roadrunner. Currently being built by IBM for the Los Alamos National Laboratory in Los Alamos, N.M., Roadrunner has the potential to achieve a never-before-sustained speed of 1,000 trillion calculations—or 1 petaflop—per second.

It will take extraordinary storage technology to back up and supply data for a system of this kind. The secret sauce behind this major storage advancement is Panasas PanFS, a parallel file system co-developed by Gibson and one that he hopes will evolve into an industry standard, called pNFS (parallel Network File System).

Most storage systems use a single head controller; pNFS eliminates the single head controller and essentially provides unlimited "controllers". (To put that in perspective, imagine replacing a 2-lane divided highway with a 10-lane freeway.) This parallel file system effectively increases the speed of data flow by over five times, potentially eliminating the single most irritating bottleneck that has plagued most supercomputing models in the past.

After joining the Carnegie Mellon faculty in 1991, Gibson founded the Parallel Data Laboratory, one of the premier academic storage system research labs. He also founded the Network-Attached Storage Device working group of the National Storage Industry Consortium and led storage systems research in the Data Storage Systems Center, one of the largest academic magnetic storage technology research labs. Gibson participates in a variety of academic and industry professional organizations.

Gibson recently spoke with Senior Writer Chris Preimesberger about the future of storage computing and how Los Alamos-level storage technology will trickle down to more mainstream enterprise applications.

How significant is the Panasas deployment at Los Alamos in the overall data storage picture? What does it portend for the rest of enterprise computing?

Government high-performance computing led the development and commoditization of Linux clusters, which are rapidly changing the way "profit-generating" IT is done.

Government HPC has been led by the mission-driven Department of Energy labs, of which Los Alamos is one of the foremost. In this sense, technologies that Los Alamos incubates are highly likely to shape future profit center IT. Internet services, energy exploration, hedge fund modeling, weather modeling, language translation and the design of silicon chips, airplanes and cars are all examples of profit center IT that are following the examples set by leading labs like Los Alamos.

Los Alamos standardization around Panasas storage for all their HPC clusters is a predictor of the storage capabilities that profit-generating IT will need in commercial data centers very soon. Then, as profit-generating IT proves out a cost-effective, advanced technology, the more conservative cost-center IT can begin to capitalize on the advantages.

I have been told that pNFS is fine after its up and running, but that it is extremely complicated to design, administer and deploy. Can an average storage administrator handle it?

Panasas storage comes out of the box, assembles in a few minutes, and powers up and is adopted into an existing system in minutes. An administrator needs do little more than decide how to allocate the new storage, and users are enabled to create and access huge files and file sets with high performance. It is not complicated to administer or deploy.

When our customers have trouble it is almost always because they have never had high-performance, NAS [network-attached storage]-like storage, and so they are surprised that tens of gigabytes per second might require more than the ancient, cheap Ethernet switch that 10-megabit-per-second e-mail and Internet are using.

What is funny about this is that traditional disk-array storage uses extremely complex and expensive SAN [storage area network] switching environments. It is very specialized technology that companies like EMC make lots of money selling storage management software to tame.

SAN customers have learned that they need to buy all that software, and hire specialized SAN managers for their data centers, just to get small fractions of the performance that Panasas can get out of the better-quality Gigabit Ethernet LANL switching technology that is understood by Internet networking staff in all data centers.

Is pNFS something that could eventually trickle down into enterprise computing?

There is PanFS, and there is pNFS. PanFS is the name of the global namespace that all Linux compute nodes connected to Panasas storage can see and share. pNFS is a short form for Parallel NFS, a new generation of the traditional NFS protocol, properly called NFS Version 4.1, that is being specified in the IETF [Internet Engineering Task Force] Internet standards body and prototyped by companies like Panasas, NetApp, EMC, IBM, HP, Sun and others.

I proposed pNFS to the NFS Version 4 standards group in late 2003 [when PanFS first went into production] as a way for high-performance file-system technology to enter into the mainstream and for mainstream NFS technology to be able to provide interoperable solutions for high-performance Linux clusters.

The pNFS specification is scheduled to be finished this year, and product announcements are expected not long after that.

But this is about Los Alamos and Roadrunner, so youd be right to ask for the connection. In fact, the core ideas came from a conversation with Los Alamos man in the incubation of storage technology, Gary Grider. He asked how his investment in technology development could be made persistent—that is, how could he be confident that the solutions he fosters remain available to customers like Los Alamos regardless of the future product directions of any one company like Panasas?

My answer then was through an industry-managed, interoperable and competitive standard protocol, and the pNFS proposal was germinated.

Along the way, Los Alamos and their national labs friends have been at the right place and the right time with a university research grant, for example, to continue the advancement of pNFS. The bulk of the development has and is being carried by the prototyping companies because the business case is compelling and because leaders like Panasas and Los Alamos proposed fair, open and evenhanded standards development processes.

Is there anything else about the pNFS clustered storage system that you see as important for us to know about that hasnt been discussed here?

Yes. Reliability and integrity. I was one of three authors of the original RAID paper in 1988. By 1997, RAID had asserted its dominance in the disk array marketplace and became the gold standard for reliable, high-integrity storage. But today there is less confidence around solutions based on traditional RAID than there has been in at least a decade, even though the low-reliability disk drives of today [the often-scorned Serial ATA desktop disks] are 10 times more reliable than the enterprise drives of a decade ago.

And storage vendors everywhere are inventing new marketing names—like RAID DP, N+3 and others—for what RAID researchers called RAID 6 15 years ago. What is really happening is that the storage capacity of disks has been getting bigger at a rate that, on average, about matches Moores Law for the rate of increase of transistors on computer chips. That is, todays disks have more than 200 times as much storage capacity. What this means is that the time to rebuild a failed disk is about 200 times longer, so the window of vulnerability to secondary failures is 200 times larger.

The probability of the disk failing to read back data is the same as it was long ago, so today you can expect at least one failed read every 10TB to 100TB. But the reconstruction of a failed 500GB disk in an 11-disk array has to read 5TB, so there can be an unacceptably large chance of failure to rebuild every one of the 1 billion sectors on the failed disk.

Because todays disk array and file system technology cannot automatically cope with the loss of even one sector in 1 billion, these rare disk-read errors become loss of entire volumes—terabytes—of data. So vendors are attempting to cope with all of these problems by employing more expensive [in capacity and performance] RAID error-correcting codes. And they are adding on-the-fly testing of checksum codes to notice silent disk errors—much more rare, but possible.

Of course the marketing hype is excessive. I started working on multiple failure correcting code for disk arrays in 1989, and on disk-array technology for reducing rebuild times [called parity declustering] a few years later. Bringing these technologies into products is long overdue, but finally happening across the industry.

When something really bad happens—disk read errors during disk failure rebuilds and maybe a network error thrown in for sport—Panasas does not toss away terabytes of data just because a tiny amount of data is unreachable. Instead Panasas automatically fences off the file containing problematic data and makes the rest of the terabytes of data available to applications and users without interruption. ´