Already a classic, this paper introduced the world to MapReduce at scale in the cluster. Today we are looking to the next generation of parallel computation frameworks. It is interesting to look at the problems they solve with an eye towards identifying problems that don't fit this model. They tackle grep, word count (and related url access count), inverted index (and related inverted hyperlink graph), term vector per host (you don't hear all that much about this MR job, I wonder if apache Nutch has implemented this?), and sort.
Mihai Budiu spoke in class today and presented a wonderful overview of both Dryad and DryadLINQ. Dryad accepts as input an execution flow in the form of an arbitrary DAG, which can be viewed as a series of stages similar to Unix pipes. DryadLINQ is a new programming language approach to expressing parallel computations which run on Dryad. It features a very SQL like syntax and integrates seamlessly into Visual Studio .NET.