Thursday, January 29, 2009

Failure Trends in a Large Disk Drive Population

Summary of the paper:
------------------
This seems like something we could (and probably should) reproduce in chukwa. Also, we definitely should have cited this paper in the Chukwa CCA paper.

They collect the following metrics about disks into Bigtable (which is on top of GFS), and then analyze it using Sawzall:
-environmental factors (such as temperatures)
-activity levels
-the Self-Monitoring Analysis and Reporting Technology (SMART) parameters

They brag about the size of their dataset. Concise statement of interesting findings near the beginning of the paper:
-little correlation between failure and temperature or activity levels
-strong correlation between failure and some SMART parameters
-many failures happen without SMART pre-indicators, so SMART data alone is not sufficient for predicting HDD failures.

This makes me wonder if the hard drives in the R cluster nodes are SMART disks. Can we collect this using a Chukwa adaptor?

Their pool was over 100,000 disks. They point out that the definition of "failure" is highly variable, and thus many disks sent back to manufacturers are deemed operational by the manufacturers tests.

Interesting point about filtering for "spurious" data, which they define simply as data that has an impossible value (e.g. negative, 100000 degrees F).

Failure and Utilization:
Bucketing their disks into high, medium, and low overall utilization, they observed higher mortality rates during infancy and at the 5 year mark.

SMART data:
In a word, useful. They found a variety of SMART parameters were useful for predicting disk failures. This includes scan errors, relocation errors, offline relocation errors, probation counds, and many others. E.g. "After the first scan error, drives are 39 times more likely to fail within 60 days than drives without scan errors then than those with none."

They also, and almost unintentionally, provide a subtle commentary on the nature of standardized device protocals such as SMART. Some of the SMART parameters defintions varied by manufacturer and some of the parameters Spin Retries were never observed, which in such a large population implies that either they are measuring something unimportant (i.e. which trivially does not occur) or are not actually implemented by the devices.

They tried to predict failures based on SMART data, and found that they could never predict more than half of the failed drives. If they had more features, perhaps more complex data (which maybe they do but just didn't feature it in the paper), It sounds like a place where someone in the RAD Lab would think to fruitfully apply machine learning

Overall comments:
---------------------------
They do a great job pointing out and presenting satisfying explanations for the unexpected findings regarding disk failure trends and correlations.

Criticisms:
---------------
I don't know that the subsubsection in 3.5.5 on vibration contributed to the paper much.

No comments: