Data sheets for hard drives have always included a specification for reliability expressed in hours: commonly known as MTBF (mean time between failures), or sometimes the mean time to failure. Same difference: One way assumes that a drive will be fixed, and the other, replaced. Nowadays, this number is around a million hours for an “enterprise” hard drive. Some drives are rated at 1.5 million hours.
Now, thats a good stretch to time. After all, a year is only 8,760 hours. One million hours comes to a bit more than 114 years. Some may be scratching their heads, since the hard drive itself has only been around for 50 years (IBMs giant 350 Disk Storage Unit for its RAMAC computer). This can be confusing.
Instead, the MTBF is a statistical measure based on a calculation extrapolated from less-lengthy readings. It all means that drives are very reliable, with a failure rate well under 1 percent per year. Go Team Storage!
However, several papers covering large-scale storage presented at FAST 07, the USENIX conference on File and Storage Technologies, held recently in San Jose, Calif., are kicking up a stir online about MTBF.
The Best Paper award was handed to “Disk Failures in the Real World: What Does an MTTF of 1,000,000 Hours Mean to You?” by Bianca Schroeder and Garth Gibson of Carnegie Mellon University in Pittsburgh.
Their study tracked a whopping set of drives used at large-scale storage sites, including high-performance computing and Web servers. The data suggests that a number of common wisdoms surrounding disk reliability are wrong.
For example, they found that annual disk replacements rates were more in the range of 2 to 4 percent and were as high as 13 percent for some sites. Yikes.
In addition, the results were contrary to the widespread IT belief in burn-in, where most problems with any drive (or electronic device, really) will be experienced at the very beginning of its life cycle (Schroeder and Gibson called this the “infant mortality effect”). Instead, the study showed that failures start off in the first few years and grow, rather than starting after a wait of five years or so, which was expected.
At the same time, the researchers said they found “little difference in replacement rates between SCSI, FC [Fibre Channel] and SATA drives, potentially an indication that disk-independent factors, such as operating conditions, affect replacement rates more than component specific factors.”
This finding will certainly bring outcry from the marketing departments at storage vendors and drive manufacturers. Drives aimed at the enterprise are supposed to be better through-and-through than commodity SATA mechanisms, running faster platters but also using more-robust components—and justifying the higher price tag. If this isnt necessarily so, then the storage vendors will have some explaining to do.
Meanwhile, Failure Trends in a Large Disk Drive Population (here in PDF), a report by a team of Google engineers presented at FAST, tackles the question of whether the SMART (self-monitoring, analysis and reporting technology) routines in drives and storage utility software are much good at predicting failure.
SMARTs failure analysis covers a range of mechanical and electrical conditions and is supposed to function like your cars oil warning system. Your cars oil level is monitored over time and when some threshold is reached, the system tells you and hopefully prevents a catastrophic failure of the engine.
However, Googles Eduardo Pinheiro, Wolf-Dietrich Weber and Luiz Andre Barroso reported that high temperature and overuse arent necessarily the predictors of failure that SMART assumes.
However, once SMART found that a drive was having scan and reallocation errors, that drive was 39 times more likely to fail over a two-month span than a drive that reported no such errors. So, “first errors” are a good sign of failure.
Still, SMART didnt really make the grade, the team suggested.
“Despite those strong correlations [the scan and reallocation errors reported by SMART], we find that failure prediction models based on SMART parameters alone are likely to be severely limited in their prediction accuracy, given that a large fraction of our failed drives have shown no SMART error signals whatsoever.
A number of storage bloggers are pushing the industry for some answers. For example, an open letter by Robin Harris on StorageMojo asks drive vendors to come clean on MTBF and suggests the problem also calls into questions reliability claims for some RAID systems.
“I believe many readers of these papers will conclude that uncomfortable facts were either ignored or misrepresented by companies that knew better or should have known better. For example, in all the discussion of RAID-DP Ive seen, the argument is couched in terms of unrecoverable read error rates, not, for example, the likelihood of two drives failing in an array is greater than assumed. Given that field MTBF rates seems to be several times higher than vendors say, Im now wondering about claimed bit error rates,” he said.
There is some smoke here, but its a complicated subject. Each site, each server and each installation for a hard disk can be different. And that will, no doubt, be much different than the testing procedure done by a storage manufacturer as well as the resulting statistical extrapolation into MTBF.
“People want to condense [MTBF] to a single sound byte. They dumb it down and lose the essence of it,” said Ed Tierney, director of marketing for storage vendor ATTO Technology, of Amherst, N.Y. The company examines the results of backroom testing as well as the rates from products in the field.
While hard disks are a much different product than the HBAs (host bus adapters) ATTO makes, the company has examined a large sample of drives. Tierney said the statistical failure rates were close to the field rates.
According to storage industry analyst Jim Porter of Mountain View, Calif.-based Disk/Trend, there isnt any reliable way to statistically review the reliability of disk drives, as used in the field.
“There are just too many different kinds of usage sites, too many variations in management skills, and a variety of disk drive types,” he said.
What seems clear is that theres a gap between the reliability expectations of manufacturers and customers. The current MTBF model isnt accounting accurately for how drives are handled in the field and how they function inside systems.
Problems with handling drives can come anywhere along the supply chain.
I spoke with an analyst a number of years ago who was standing on the docks in Malaysia while visiting the fab operation of a major disk manufacturer. He watched the shipping containers filled with drives being loaded into the boat.
Suddenly, he said, a chain broke and the container fell many stories onto the concrete pier. The chain was refastened and raised up again. Those drives found their way into servers and after-market systems.
And we wonder that some series of hard disks get a reputation for problems? Maybe some of those drives have been treated to a G-shock test not on the record books.
But I see a trend toward a cavalier attitude toward the handling of drives in the field, especially 3.5-inch mechanisms. Perhaps people have grown used to the handling of 2.5-inch notebook drives that are designed to take a bit more of a beating than their larger cousins.
Porter said that its hard to draw conclusions about MTBF even from a large sample of drives, since they are basically all individual experiences. Remember, he said, that the industry shipped more than 400 million hard disk drives in 2006.
“Some disk drives will fail, because its expected within the reliability specs were discussing. Thats why RAID versions of storage systems were developed, because the failure of two drives within the same storage system at the same time is extremely rare,” he said.
Yet, the distance between “rare” and “impossible” seems to have been bridged in the minds of customers. Inflated or not, when failure rates are counted in years, then the worry is pushed out of mind and into another years budget.
Its easy to think that a failed drive will always be found in someone elses server with someone elses data.
No doubt the inflated MTBF stats on the spec sheets have helped that misunderstanding along. Even if MTBF were a reliable predictor for some perfect hard drive on the testing bench, drives in the field will fail and fail regularly.
Worse, theres now the expectation that all data will live forever; that no data will be lost. Come on—that isnt reality.
Heres a slice of this reality disjunction. Vendors tell us that a RAID 6 array can have a “mean time before data loss” of some 86,695 years. Yet at several conferences Ive attended in the past year, someone predicted that somewhere soon a RAID Level 6 array will fail. Thats with double the redundancy.
Certainly, the lesson from the FAST research is that IT budgets must include a line item for regular replacement of hard disks, even if the MTBF says it isnt necessary. This may cut into the expenditures of new storage systems, something that CIOs and storage vendors prefer.
I recall a bit of discussion about MTBF at a meeting of the San Francisco SNUG (storage networking user group) last summer. One reseller in the group asked a storage vendor to “quit publishing this crap.”
That recommendation will be a tough one for the marketing department to execute. In the meantime, we can all start by taking a more realistic attitude toward MTBF.
What do you think? Is MTBF an outrage? Or did you see through it all along? Let us know here.