Hard Disk MTBF: Wheres the Reliable Reliability Data?

Opinion: Where can we get some straight talk about hard disk reliability? It sure can't be found in the mean time between failure ratings found on spec sheets. Readers suggest vendors may know the answer but just don't want to rock the storage indu

The revelations in a couple of research papers on troubles concerning the MTBF specification for hard disk reliability sparked readers to suggest that there must be a better way to suss out a potential problem drive in the server closet. Furthermore, they have a good idea of who may have a finger on the real-world data and why that information isnt receiving an audience.

As I mentioned in a recent column on mean time between failure, a couple of papers presented at FAST 07 (the USENIX conference on File and Storage Technologies) showed that annual disk replacement rates are much higher than predicted, the well-held belief in a burn-in phase for hard disk life cycle was wrong, and the SMART (self-monitoring, analysis and reporting technology) code in hard drives and storage management software—long touted by the industry as the best predictor of disk failure—was mostly a security blanket for IT managers.

Digging through a deep bucket of responses, many readers expressed some form of shrug about MTBF. Some may have believed in the rating in the past, but no longer.

"As you undoubtedly recall, MTBF was created by the drive manufacturers back when hard disks failed with much greater regularity. It was a way to reassure customers that manufacturers took reliability seriously, that drives were tested, and that by comparing these obviously inflated figures, you could assume that a drive with a million-hour rating was better than one with a half-million-hour rating," observed Barry Cohen, chief technical officer with technology analysis and consulting firm The Edison Group of New York.

"Actually believing that the drives themselves would last that long is a personal problem," he counseled.

Now, Cohen has a good point: MTBF is just a statistical measure. Were not supposed to believe it.

Still, storage managers do want to know whats going to happen to their hard disks. One side of our brain knows that each disk is just one example out of a production run in a product line that may have hundreds of thousands or even millions of units. The other side wants a date and time.

/zimages/2/28571.gifDo enterprise clients really need bigger and bigger hard disks? Maybe not. Click here to read more.

Ronald Major, manager at Sherwin Williams of Cleveland, said of course we all know that all drives fail and that MTBF has to be taken in the context of a population of drives. He believes that even storage vendors seem at times not to understand what MTBF means in the context of their own products.

"I asked a storage vendor about the MTBF of their drives, and he explained that they prefer to use mean time to data loss. I suppose thats supposed to make me feel better. But what does it really mean?" Major said.

"You have a data loss on a Tier 1 array, and it sucks. I would be truly impressed with a vendor if they could tell me how many drives per year I can expect to fail, rather than how reliable their gear is. I would feel confident [then] that they know what theyre talking about," he said.

Major hopes that, with the hubbub surrounding MTBF, perhaps the storage industry will present real-world reliability for their products, rather than some data from "idealized environments."

However, other readers said that the answer may be found by examining disks in controlled environments. In fact, the storage in question could be your own. But dont expect to know any more than you do now.

Former EMC employee Steve Smith, who runs an IT management consulting business in Bellevue, Wash., said that major suppliers of RAID, NAS and SAN to the enterprise and high-performance computing sites must have sufficient statistical information about MTBF.

"The controlled environment within an enterprise-class storage array is carefully monitored and controlled. The drives in these arrays are constantly compared by the suppliers. [But] these suppliers dont share their numbers with customers," Smith said.

Why not? He said suppliers believe customers would be shocked and react poorly.

"The simple fact is the internal story [the reliability statistics] doesnt match what customers assume. After years of letting their customers believe a fantasy, the suppliers are hesitant to reframe expectations around reliability," Smith continued.

"Why should the suppliers reset their customers expectations about MTBF?" he asked. "Unless all of them do it simultaneously, someone will lose sales revenue. None of them will take that risk. And I wouldnt recommend they reveal the numbers without an offsetting benefit."

According to Smith, customers that purchase mass quantities of arrays could change this picture by demanding the real numbers from suppliers. But this tactic would take some backbone from the IT and purchasing departments.

"If sales revenue depends on showing the numbers, it will happen," he said. "The suppliers will ask for nondisclosure agreements. But if the large customer refuses to sign and says they will buy from another supplier who will reveal the numbers, supplier revelation is inevitable."

Next Page: Follow the warranty.