Thursday, January 29, 2009

Data Center Failure Stories

First, this story offers some insights into datacenter failure models. I found it interesting that these were not, for the most part, faults due to ignorance. They were faults that occurred despite a lot of engineering efforts, testing, and close supervision of the production systems. Bus Duct, which I had not heard of before reading this article, seems to be specifically designed for the industrial purpose for which it was being used, so finding somebody to blame for the failure would be difficult in the failure circumstance described.

I think the lesson to be learned is that unpredicted failures are going to occur, despite the best engineering efforts, as the second assigned reading confirmed, it says:

"81 percent of respondents had experienced a failure in the past five years, and 20 percent had been hit with at least five failures."


One area of emerging research that we are addressing as part of the RAD Lab project is the use of multiple data centers. In the 90's we changed the world with the NOW (Network of Workstations) project. Maybe one of the current decade's memorable Berkeley systems research projects will be NOD (Network of Datacenters).

reminds me of a topic which we recently explored in discussions about cloud computing here in the RAD Lab: reputation fate sharing. In the horror stories presented in this article, the Colocation facility and the government agency running the datacenters that failed were reputation fate sharing with the construction companies they contracted with and the equipement manufacturers they purchased from.

The third article on datacenter failure was actually documentation for a sun directory service backup strategy. As a document aimed at the administrators of a real system, it is highly pragmatic with advice about inherent trade-offs such as deciding at which level of the architecture and which format to backup data (in their case, binary replication at the filesystem level vs other formats).

The issue of which level to build replication and fault tolerance into a system architecture also came up during the discussion I reference above about the use of multiple datacenters for cloud application fault tolerance. Somebody pointed out that one data store technology did intelligent replication at the application level while another punted on the issue, delegating fault tolerance to lower levels of the architecture. The classic end-to-end systems principle comes to mind here, and I believe that there is a direct trade-off between the latency of inter-datacenter replication (and thus consistency in the data store context) and the amount of information you can give the system. For example, if HDFS were to provide cross datacenter replication, in the absence of any hints it would simply replicate all files at the block level (using compression of course), while perhaps only a small subset of those blocks would be sufficient if the application were willing to participate in the replication process.

This third article also present elements of the sort of preparedness strategies that the first two articles so strongly admonish, complete with a "for dummies" decision flow chart.

1 comment:

Matei Zaharia said...

I like the name NOD! We have to write at least a HotOS paper about it.