Thursday, February 12, 2009

Google's BigTable

Summary
I'm going to be repeating myself during my summary. I love reading these papers. As I learn more about system, particularly distributed systems, these papers get even more beautiful.

Sixty Google products use big table: Google Analytics, Google Finance, Orkut, Personalized Search, Writely, and Google Earth.

Trade off the full relational model for simlicity and dynamic control, and some transparency into data locality through schemas (column groups?)

Arbitrary string indexes (for column and row names).

Data stored as uninterpreted string blobs.

Choice of storing on memory or disk

Data model:
-----------------
(row:string, column:string, time:int64) → string

Rows
-Table is sorted by row key
-Every read or write of data under a single row key is atomic
-Rows partitioned into tablets - mechanism of getting locality so pick good names (e.g. urls in reverse order)

Column famillies
-access control
-number of col. families should be small
-all cols in same family compressed together
-accounting

Timestamps
-auto or manual
-garbage collection based on col. family rules

Other infrastructure
-GFS
-"cluster management system"
-sstable - file format for GFS - smart structure - minimize disk accesses
-chubby - distributed lock service
-paxos, for master election

Architecture:
==========
Client library

Master:
-track tablet servers

Tablets server:
-stored in GFS, replication, organized and served up by tablet servers
-commit record for redo logging
-metadata table, composed of root tablet and others.
-root table location stored in chubby

Apps:
Examples of applications that are using it.
800TB for a web crawl

What is the problem? Is the problem real?
Highly scalable data center storage.

What is the solution's main idea (nugget)?
wide applicability, scalability, high performance, and high availability

Why is solution different from previous work? (Is technology different?, Is workload different?, Is problem different?)
SCALE. Database people did this in the 70's, just ask them. but not at this scale. 500 tablet servers, no doubt petabytes by now.

Not a beautiful new idea, but a beautiful application of old ideas, i.e. engineering. bullet proof system tackling before untouched problems/scales.

high level lessons:
--------------------------
-read the lessons section
-scale is hard. bit flip in layer3 router at yahoo, causing hard to debug sofware level bug. Solutin: build simple systems that do one thing well, and then glue them together.
-simplicity
-make files (SSTables) immutable, use garbage collection on them
-related work, they say other systems included too many features (chord) such as churn, heterogeneity, trust.
-distribute everything possible
-no single point of failure (chubby, master not bottleneck)
-scale = better perf. amdahls law, figure out how much you CAN parallelize, then do it. master doesn't hold all metadata
-measure everything
Conclusion:
-great engineering
-compression (lots of duplication with versioning = big wins)
-single commit log per tablet server
-client library caching tablet location

No comments: