Monday, March 2, 2009

Pig latin: a not-so-foreign language for data processing

Summary of Pig latin: a not-so-foreign language for data processing

What is the problem? Is the problem real?
Programmers don't like Declarative languages, but MapReduce is too low level, Pig Latin is something in between.

What is the solution's main idea (nugget)?
Pig latin is a language targeted at experienced programmers to do ad hoc (read only, scan centric) analysis of extremely large data.

Why is solution different from previous work? (Is technology different?, Is workload different?, Is problem different?)
MapReduce was lower level (pig latin compiles into MR) and SQL is too declarative (or so they claim). Hive (which wasn't previous work anyway) aims at SQL compiled into MR, which is distinctly different from Pig latin.

I like that their system doesn't need to import that data using a loading stage like other database systems do, instead you give it as input a function specifying how to get tuples out of the input file. In particular, as they point out, this allows for easier interoperability with the many other big data manipulation tools in use at somewhere like Yahoo.

Hard Tradeoffs
  • Declarative vs performance/control

Will it be influential in 10 years?
If by some merciful act of god Yahoo! doesn't implode, yes probably.

Criticisms
It was bold of them to state that programmers prefer to analyze data by writing "imperitve scripts or code" as opposed to using a declarative language. They sort of shrug off automated query optimization, which constitutes an extremely large body of research, as insufficient.

No comments: