Hard Disk MTBF: Flap or Farce?

Opinion: Research into the real-world reliability of hard disks in the data center now point to inflated MTBF (mean time between failures) numbers on vendors' spec sheets. But really, why did anyone believe them to begin with?

Data sheets for hard drives have always included a specification for reliability expressed in hours: commonly known as MTBF (mean time between failures), or sometimes the mean time to failure. Same difference: One way assumes that a drive will be fixed, and the other, replaced. Nowadays, this number is around a million hours for an "enterprise" hard drive. Some drives are rated at 1.5 million hours.

Now, thats a good stretch to time. After all, a year is only 8,760 hours. One million hours comes to a bit more than 114 years. Some may be scratching their heads, since the hard drive itself has only been around for 50 years (IBMs giant 350 Disk Storage Unit for its RAMAC computer). This can be confusing.

Instead, the MTBF is a statistical measure based on a calculation extrapolated from less-lengthy readings. It all means that drives are very reliable, with a failure rate well under 1 percent per year. Go Team Storage!

However, several papers covering large-scale storage presented at FAST 07, the USENIX conference on File and Storage Technologies, held recently in San Jose, Calif., are kicking up a stir online about MTBF.

The Best Paper award was handed to "Disk Failures in the Real World: What Does an MTTF of 1,000,000 Hours Mean to You?" by Bianca Schroeder and Garth Gibson of Carnegie Mellon University in Pittsburgh.

Their study tracked a whopping set of drives used at large-scale storage sites, including high-performance computing and Web servers. The data suggests that a number of common wisdoms surrounding disk reliability are wrong.

For example, they found that annual disk replacements rates were more in the range of 2 to 4 percent and were as high as 13 percent for some sites. Yikes.

In addition, the results were contrary to the widespread IT belief in burn-in, where most problems with any drive (or electronic device, really) will be experienced at the very beginning of its life cycle (Schroeder and Gibson called this the "infant mortality effect"). Instead, the study showed that failures start off in the first few years and grow, rather than starting after a wait of five years or so, which was expected.

At the same time, the researchers said they found "little difference in replacement rates between SCSI, FC [Fibre Channel] and SATA drives, potentially an indication that disk-independent factors, such as operating conditions, affect replacement rates more than component specific factors."

This finding will certainly bring outcry from the marketing departments at storage vendors and drive manufacturers. Drives aimed at the enterprise are supposed to be better through-and-through than commodity SATA mechanisms, running faster platters but also using more-robust components—and justifying the higher price tag. If this isnt necessarily so, then the storage vendors will have some explaining to do.

Meanwhile, Failure Trends in a Large Disk Drive Population (here in PDF), a report by a team of Google engineers presented at FAST, tackles the question of whether the SMART (self-monitoring, analysis and reporting technology) routines in drives and storage utility software are much good at predicting failure.

SMARTs failure analysis covers a range of mechanical and electrical conditions and is supposed to function like your cars oil warning system. Your cars oil level is monitored over time and when some threshold is reached, the system tells you and hopefully prevents a catastrophic failure of the engine.

However, Googles Eduardo Pinheiro, Wolf-Dietrich Weber and Luiz Andre Barroso reported that high temperature and overuse arent necessarily the predictors of failure that SMART assumes.

However, once SMART found that a drive was having scan and reallocation errors, that drive was 39 times more likely to fail over a two-month span than a drive that reported no such errors. So, "first errors" are a good sign of failure.

Still, SMART didnt really make the grade, the team suggested.

"Despite those strong correlations [the scan and reallocation errors reported by SMART], we find that failure prediction models based on SMART parameters alone are likely to be severely limited in their prediction accuracy, given that a large fraction of our failed drives have shown no SMART error signals whatsoever.

A number of storage bloggers are pushing the industry for some answers. For example, an open letter by Robin Harris on StorageMojo asks drive vendors to come clean on MTBF and suggests the problem also calls into questions reliability claims for some RAID systems.

"I believe many readers of these papers will conclude that uncomfortable facts were either ignored or misrepresented by companies that knew better or should have known better. For example, in all the discussion of RAID-DP Ive seen, the argument is couched in terms of unrecoverable read error rates, not, for example, the likelihood of two drives failing in an array is greater than assumed. Given that field MTBF rates seems to be several times higher than vendors say, Im now wondering about claimed bit error rates," he said.

There is some smoke here, but its a complicated subject. Each site, each server and each installation for a hard disk can be different. And that will, no doubt, be much different than the testing procedure done by a storage manufacturer as well as the resulting statistical extrapolation into MTBF.

Next Page: Getting Real About MTBF