Tuesday, November 25, 2008

Hadoop

Sweet! A crap ton of feedback on our work and ideas for future work all for free! What a good idea to have the class read this. I agree entirely with Yanpei's observations about our work, that we should be looking for the general problems in MapReduce, and even more generally in distributed systems.

In the spirit of Matei's self criticism, you might find our summaries of the feedback we got from the OSDI reviewers interesting.

I would argue, however, that by looking at what Hadoop got wrong, we learn more than "Hadoop developers are bad coders" because they are not. Rather we learn what are the non-trivial parts of a disturbed system. In the mapreduce paradigm, backup tasks are SUPER important. So regardless of whether Hadoop had done a crappy job on them in the first place, research into the complexities of scheduling in real world distributed systems--and especially principled reasoning about lower bounds on running time of solutions to the problems we face--is an area of research that offers a delightful blend of theory and systems building (engineering?).

I'll take this work over Kurtis's science any day.

Some thoughts from 1) my head and 2) meetings with Matei and Anthony earlier today:
-how does the minimum makespan scheduling algorithm change when you introduce stragglers (multiple copies of some tasks)
-we need to build a model of the distribution of the jobs and also of the speeds of the nodes
-how does being able to bring up an extra node (i.e. EC2) effect the model?
-low hanging fruit: fix the dumb 1/3 1/3 1/3 task progress reporting problem in hadoop.

-rework the scheduler to use chukwa data to get a huge improvement in performance. what is the simplest way to use this data to be smarter?
-we would want to keep an extra field in each "task tracker" object on the jobtracker that just queries the task trackers about
-we would also want to keep track of a list of features which each has a number assigned with it that represents the average effect of that feature on task response time.
-are there papers on simple models for using feedback like this in scheduling?


TASKS:
-get chukwa running on all nodes of the R cluster
-get chukwa running on EC2
-get chukwa collecting xtrace stuff from hadoop

No comments: