Thursday, January 29, 2009

Failure Trends in a Large Disk Drive Population

Summary of the paper:
This seems like something we could (and probably should) reproduce in chukwa. Also, we definitely should have cited this paper in the Chukwa CCA paper.

They collect the following metrics about disks into Bigtable (which is on top of GFS), and then analyze it using Sawzall:
-environmental factors (such as temperatures)
-activity levels
-the Self-Monitoring Analysis and Reporting Technology (SMART) parameters

They brag about the size of their dataset. Concise statement of interesting findings near the beginning of the paper:
-little correlation between failure and temperature or activity levels
-strong correlation between failure and some SMART parameters
-many failures happen without SMART pre-indicators, so SMART data alone is not sufficient for predicting HDD failures.

This makes me wonder if the hard drives in the R cluster nodes are SMART disks. Can we collect this using a Chukwa adaptor?

Their pool was over 100,000 disks. They point out that the definition of "failure" is highly variable, and thus many disks sent back to manufacturers are deemed operational by the manufacturers tests.

Interesting point about filtering for "spurious" data, which they define simply as data that has an impossible value (e.g. negative, 100000 degrees F).

Failure and Utilization:
Bucketing their disks into high, medium, and low overall utilization, they observed higher mortality rates during infancy and at the 5 year mark.

SMART data:
In a word, useful. They found a variety of SMART parameters were useful for predicting disk failures. This includes scan errors, relocation errors, offline relocation errors, probation counds, and many others. E.g. "After the first scan error, drives are 39 times more likely to fail within 60 days than drives without scan errors then than those with none."

They also, and almost unintentionally, provide a subtle commentary on the nature of standardized device protocals such as SMART. Some of the SMART parameters defintions varied by manufacturer and some of the parameters Spin Retries were never observed, which in such a large population implies that either they are measuring something unimportant (i.e. which trivially does not occur) or are not actually implemented by the devices.

They tried to predict failures based on SMART data, and found that they could never predict more than half of the failed drives. If they had more features, perhaps more complex data (which maybe they do but just didn't feature it in the paper), It sounds like a place where someone in the RAD Lab would think to fruitfully apply machine learning

Overall comments:
They do a great job pointing out and presenting satisfying explanations for the unexpected findings regarding disk failure trends and correlations.

I don't know that the subsubsection in 3.5.5 on vibration contributed to the paper much.

Data Center Failure Stories

First, this story offers some insights into datacenter failure models. I found it interesting that these were not, for the most part, faults due to ignorance. They were faults that occurred despite a lot of engineering efforts, testing, and close supervision of the production systems. Bus Duct, which I had not heard of before reading this article, seems to be specifically designed for the industrial purpose for which it was being used, so finding somebody to blame for the failure would be difficult in the failure circumstance described.

I think the lesson to be learned is that unpredicted failures are going to occur, despite the best engineering efforts, as the second assigned reading confirmed, it says:

"81 percent of respondents had experienced a failure in the past five years, and 20 percent had been hit with at least five failures."

One area of emerging research that we are addressing as part of the RAD Lab project is the use of multiple data centers. In the 90's we changed the world with the NOW (Network of Workstations) project. Maybe one of the current decade's memorable Berkeley systems research projects will be NOD (Network of Datacenters).

reminds me of a topic which we recently explored in discussions about cloud computing here in the RAD Lab: reputation fate sharing. In the horror stories presented in this article, the Colocation facility and the government agency running the datacenters that failed were reputation fate sharing with the construction companies they contracted with and the equipement manufacturers they purchased from.

The third article on datacenter failure was actually documentation for a sun directory service backup strategy. As a document aimed at the administrators of a real system, it is highly pragmatic with advice about inherent trade-offs such as deciding at which level of the architecture and which format to backup data (in their case, binary replication at the filesystem level vs other formats).

The issue of which level to build replication and fault tolerance into a system architecture also came up during the discussion I reference above about the use of multiple datacenters for cloud application fault tolerance. Somebody pointed out that one data store technology did intelligent replication at the application level while another punted on the issue, delegating fault tolerance to lower levels of the architecture. The classic end-to-end systems principle comes to mind here, and I believe that there is a direct trade-off between the latency of inter-datacenter replication (and thus consistency in the data store context) and the amount of information you can give the system. For example, if HDFS were to provide cross datacenter replication, in the absence of any hints it would simply replicate all files at the block level (using compression of course), while perhaps only a small subset of those blocks would be sufficient if the application were willing to participate in the replication process.

This third article also present elements of the sort of preparedness strategies that the first two articles so strongly admonish, complete with a "for dummies" decision flow chart.

Saturday, January 24, 2009

Introduction to Network Computing

This website is about as useless as network computing was. Though I don't blame Ion for not being able to find anything good about the failed endeavor.

One thing I did learn was that Apple made a game console/network computer called Pippen, which apparently, like all things NC, failed miserably.

An experimental time-sharing system - Fernando Corbato

First of all, this paper is old, 1962

Memory challenges were identified: things about isolation

Programming problems were identified: accounting, supervisor must manage shared IO resources, need good programming tools

Other problems: already thinking about failure
Their design: basically how we would build this today, probably because this was the an ancestor to our current time sharing systems.

some of the most challening problems:
-HCI, which required considerably more experimentation and evaluation
-multiple terminals communicating with one program
-get some disks (only had tapes)

haha, operators take commands from "the supervisor", i.e. root, e.g. change this tape out

They use a priority queue scheduling policy, assigning priority of processes based on number of words in the program (equation (2)). This seems like a very crude first stab, essentially rating based on how fast we can get the process into and out of memory from tape.

They run a job from queue L_n for 2 * l quanta of time, and if it doesn't finish then they move it up to queue L+1. They run everything at queue level L (i.e. until level L is empty), then everything at level L+1 until L+1 is empty, etc. If a job enters a queue at a level lower than the currenly operating level, they jumpt down down to it (i.e. to that queue level) and start the process over from that level.

They keep efficiency of system > 1/2 by enforcing the policy that the time quantum must be larger than the context switch overhead time (equation (3)).

They also give some numbers for what parameters would be for the IBM 7090