Tuesday, March 31, 2009

Erlang, Scala

What is the problem? Is the problem real?

Desire for functional programming language for distributed systems.

What is the solution's main idea (nugget)?

All of the introductions to the language state early on that Erlang provides many of the features that an OS does such as garbage collection, scheduling and concurrent processes. They are aiming at systems with these properties:
  • Soft real time
  • Millions of lines of code
  • 0% downtime
  • portable (I wonder why?)
  • Lots of parallel processes
  • IPC
  • Clustered environment
  • JIT code loading
  • Rostbustness via 3 error detection primitives

Why is solution different from previous work? (Is technology different?, Is workload different?, Is problem different?)

This is the first programming language I have studied which incorporates so many features of an operating system.

Hard Tradeoffs

  • level of abstraction vs. flexibility

Will it be influential in 10 years?

With millions of lines of code written it and much of the telecom network depending on it, it is already influential.

Friday, March 27, 2009

DTrace, XTrace, Friday


DTrace: Dynamic Instrumentation of Production Systems

DTrace is a tracing tool with the following features:
-entirely dynamic, has zero cost when not turned on
-centralized, one stop shop for all kernel and user level tracing
-built into Solaris, and subsequently other systems.
-guarantees no silent trace data corruption
-arbitrary actions can be used to enable probes
-predicates (filtering) at the source
-virtualized per consumer (so no weird sharing of probes)

->Providers are kernel modules responsible for instrumenting the system. the framework calls them and they call back into the dtrace kernel to set up a probe
->Probes are created and then advertised to consumers, who can enable them , which creates an enabling control block (ECB). if more than one consumer enables a probe, then each one has an ECB associated with it and when the provider fires that probe then the DTrace framework loops through all ECBs for it and runs each one independently (this is where per ECB predicates kick in and provide filtering of which probes will be activated per consumer)
->each ECB has a list of actions for the probe, the actions are run and can store stuff in per CPU mem buffer that each consumer has associated with it. Actions are limited to restrict the amount of overhead and damage they can impose (e.g. can't change registers). An ECB always storest the same amount of data in the consumer's mem buffer. actions can fail because mem buffer is full, i.e. the data is simply droped and buffers drop count is ++.
->Two buffers are kept per consumer per cpu, one active, one inactive. INTERRUPTS ARE DISABLED IN BOTH PROBE PROCESSING AND BUFFER SWITCHING
->Actions and predicates are specified in a simple RISC instruction set, also a C-like language called D

What is the problem? Is the problem real?
Today, performance testing tools are aimed at developers but the perfomnace testing itself is being done more by systems integrators, and tools for this level fo perf. testing are scarse.

What is the solution's main idea (nugget)?
A complex but amazingly well engineered low level tracing tool which can capture system calls, function boundries, kernel synchronization primitives, and many other metrics.

Why is solution different from previous work? (Is technology different?, Is workload different?, Is problem different?)
They adhere to a zero overhead (when not enabled) policy, also this system is implemented more ubiquituously and at a deeper level than any other tracing tracing tool I know of.

Can you write straight assembly code for your actions/predicates? (probably.) Do permissions ever become an issue? It seems like you need to run this as root, are there scenarios that would require not giving someone root but still giving them (potentially restricted) dtrace functionality.


This paper Presents a path based tracing tool which requires the instrumentation of source code to propagate a small constant amount of trace metadata and emit log statements. The log statements can be reconstructed offline into an partial ordering so that the relative timing of many events (e.g. function call or returns), even those events collected from different architectural layers of the system, can be known deterministically.

What is the problem? Is the problem real?
Doing finger pointing and event correlation in a distributed system is difficult.

What is the solution's main idea (nugget)?
We can augment the data execution path to pass around a small fixed amount of metadata which can be used to correlate events in a system distributed physically or even administratively.

Why is solution different from previous work? (Is technology different?, Is workload different?, Is problem different?)
Previous work did not focus on the cross layer aspect of causal tracing.

One criticism I have is that code instrumentation is a heavyweight tool, especially when it comes to low level libraries like RPC libs, and they don't present any work on techniques for automating the pain of this implementation.
They also cite Pinpoint, which they describe as being very similar to X-Trace. In the paper they say "Our work focuses on recovering the task trees associated with multi-layer protocols, rather than the analysis of those recovered paths." The lack of focus on analysis of trace data is actually a weakness, not something which should be used to differentiate the systems. Also, the analysis of the trace data presupposes the feasible collection of it, which makes me wonder if the Pinpoint system simply took the collection as trivial or as less interesting than the analysis

Artemis, Scribe, Chukwa


A metrics and log aggregation tool built on top of Dryad. Mostly a debugging tool for Dryad jobs, but cast as a tool for cluster and cloud services management in general, the developer can choose from thousands of metrics which Microsoft Windows exports for collection and filtering. The idea is to witness the impact of a Dryad job on individual nodes, particularly to spot unexpected behavior. Drawbacks: not open sources, built on MS technologies (e.g. .Net for UI, not web based)


An open source project created at Facebook, this system shines in its simplicity (collect logs) and its effective reuse of the lower level Thrift RPC project. Once logs are collected, they can be queried in an ad-hoc fashion using Hive.


A scalable log, metrics, and generic data collection framework built by Yahoo on top of Hadoop MapReduce and HDFS. The idea is to store everything possible in its raw form and do analysis later. This is in contrast to Artemis which does filtering before collection and stores the results in a much smaller database than Chukwas HDFS archival store.

Azure, Google Apps Security, AWS

Microsoft Azure

This first link takes us to a list of white papers that have been written about the up coming Azure cloud service, the second link points us to the the most helpful of these white papers, called Introducing the Azure Services Platform. The components of the MS cloud software stack are:
  • Windows Azure - Still vague, but will probably be a virtual machine management layer similar to that provided by Amazon's EC2 service.
  • .NET Services - Access control, service management, application integration.
  • SQL Services - Their storage system (like AWS SimpleDB)
  • Live Services -Ways of integrating with windows Live applications (their office SaaS)
The details of these are still quite vague, so it is yet to be seen if they will pull it all together, but at a high level they seem to have the key elements covered. I am curious about which level they intend to provide access control at. I'm guessing it is at the cloud consumer level, similar to the mechanisms AWS uses to keep separate users isolated. Amazon doesn't provide much in the way of a Service Bus but I'm not sure what the difference is between MS's Service Bus and classic SOA.

It seems like they are going to face tough decisions about how much of Azure will be integrating current MS technologies (like .NET) and how much of Azure will be new infrastructure. Also, they are already failing at keeping things simple and easy to understand the way Amazon is.

Google Security

Comprehensive review of security and vulnerability protections

This is a google white paper about the obvious security measures that Google employs to protect the massive amounts of personal data they collect from their users. I was very disappointed that they spent the entire paper telling us that they employ all of the security measures that we would expect from even a novice company worried about getting sued by customers over privacy violations.

Amazons AWS

This White Paper on 'Cloud Architectures' and Best Practices of Amazon's web services begins with a high level description of a web grep application that uses SQS, SimpleDB, EC2, and Hadoop. The second part of the paper contains high level advice for developing cloud applications, including:
  • Use scalable and elastic components (obvious)
  • Use loosely coupled components, use queues (cool)
  • Think parallel (lame)
  • Use elastic components (redundant)
  • Handle failures as the common case (basically Recovery Oriented Computing)
One thing they ignore is the extremely high cost of running a single grep query using their system: 1 query against 10 million documents takes 6 minutes and costs 10 dollars. Compare this to a google search of somewhere between 20 and >60 billion documents which returns in tens of milliseconds and costs 0 dollars.

MapReduce, Dryad, DryadLINQ


Already a classic, this paper introduced the world to MapReduce at scale in the cluster. Today we are looking to the next generation of parallel computation frameworks. It is interesting to look at the problems they solve with an eye towards identifying problems that don't fit this model. They tackle grep, word count (and related url access count), inverted index (and related inverted hyperlink graph), term vector per host (you don't hear all that much about this MR job, I wonder if apache Nutch has implemented this?), and sort.

Dryad and DryadLINQ

Mihai Budiu spoke in class today and presented a wonderful overview of both Dryad and DryadLINQ. Dryad accepts as input an execution flow in the form of an arbitrary DAG, which can be viewed as a series of stages similar to Unix pipes. DryadLINQ is a new programming language approach to expressing parallel computations which run on Dryad. It features a very SQL like syntax and integrates seamlessly into Visual Studio .NET.

Monday, March 2, 2009

Pig latin: a not-so-foreign language for data processing

Summary of Pig latin: a not-so-foreign language for data processing

What is the problem? Is the problem real?
Programmers don't like Declarative languages, but MapReduce is too low level, Pig Latin is something in between.

What is the solution's main idea (nugget)?
Pig latin is a language targeted at experienced programmers to do ad hoc (read only, scan centric) analysis of extremely large data.

Why is solution different from previous work? (Is technology different?, Is workload different?, Is problem different?)
MapReduce was lower level (pig latin compiles into MR) and SQL is too declarative (or so they claim). Hive (which wasn't previous work anyway) aims at SQL compiled into MR, which is distinctly different from Pig latin.

I like that their system doesn't need to import that data using a loading stage like other database systems do, instead you give it as input a function specifying how to get tuples out of the input file. In particular, as they point out, this allows for easier interoperability with the many other big data manipulation tools in use at somewhere like Yahoo.

Hard Tradeoffs
  • Declarative vs performance/control

Will it be influential in 10 years?
If by some merciful act of god Yahoo! doesn't implode, yes probably.

It was bold of them to state that programmers prefer to analyze data by writing "imperitve scripts or code" as opposed to using a declarative language. They sort of shrug off automated query optimization, which constitutes an extremely large body of research, as insufficient.

Wednesday, February 25, 2009


Notes on Tyson's presentation about PNUTS:
  • complaint: not enough control over replication
  • predicate to system, apply to the system before
  • hash or sorted tables
  • no foreign key constraints (not normalized?)
  • selection predicates, and projection from tables
  • no joins
  • per record timeline consistency, all replicas apply updates to record in same order. all updates go to a master, which moves around for performace
  • tyson likes that they took the right pieces out of the db to support the right pieces (read one record at a time)
  • no fine control over replication factor
  • rethink isolation?

Notes on Michael's SCADS talk:
  • Scale independence (borrowed idea from db papers, data independence)
    • No changes to app as you scale
    • latency stays constant
    • const/users stays constant
  • auto scale up/down
  • declarative performance consistency
  • perf. safe query language (cardinality constraints)
  • give developers axis that they understand:
    • performance SLA
    • read consistency (freshness)
    • write consistency
    • durability SLA
    • Session guarantees (monontonic reads, read your own writes)

Thursday, February 19, 2009


What is the problem? Is the problem real?
Highly available, partition tolerant, high performance value-key store.

What is the solution's main idea (nugget)?
Use various well known distibuted systems techniques to build a flexible storage solution.
-consistent hashing (sounds like Chord)
-object versioning (sounds like BigTable)
-"sloppy quorum" (from talk) for consistency
-"decentralized replica synchronization protocol"

Why is solution different from previous work? (Is technology different?, Is workload different?, Is problem different?)
Their demands are slightly different than the related work (e.g. Chord), most of which they build off of. They do it at scale and it really works, and is in production!

How is this different from S3? - provides control over availability/performance tradeoff

Thursday, February 12, 2009

Google's BigTable

I'm going to be repeating myself during my summary. I love reading these papers. As I learn more about system, particularly distributed systems, these papers get even more beautiful.

Sixty Google products use big table: Google Analytics, Google Finance, Orkut, Personalized Search, Writely, and Google Earth.

Trade off the full relational model for simlicity and dynamic control, and some transparency into data locality through schemas (column groups?)

Arbitrary string indexes (for column and row names).

Data stored as uninterpreted string blobs.

Choice of storing on memory or disk

Data model:
(row:string, column:string, time:int64) → string

-Table is sorted by row key
-Every read or write of data under a single row key is atomic
-Rows partitioned into tablets - mechanism of getting locality so pick good names (e.g. urls in reverse order)

Column famillies
-access control
-number of col. families should be small
-all cols in same family compressed together

-auto or manual
-garbage collection based on col. family rules

Other infrastructure
-"cluster management system"
-sstable - file format for GFS - smart structure - minimize disk accesses
-chubby - distributed lock service
-paxos, for master election

Client library

-track tablet servers

Tablets server:
-stored in GFS, replication, organized and served up by tablet servers
-commit record for redo logging
-metadata table, composed of root tablet and others.
-root table location stored in chubby

Examples of applications that are using it.
800TB for a web crawl

What is the problem? Is the problem real?
Highly scalable data center storage.

What is the solution's main idea (nugget)?
wide applicability, scalability, high performance, and high availability

Why is solution different from previous work? (Is technology different?, Is workload different?, Is problem different?)
SCALE. Database people did this in the 70's, just ask them. but not at this scale. 500 tablet servers, no doubt petabytes by now.

Not a beautiful new idea, but a beautiful application of old ideas, i.e. engineering. bullet proof system tackling before untouched problems/scales.

high level lessons:
-read the lessons section
-scale is hard. bit flip in layer3 router at yahoo, causing hard to debug sofware level bug. Solutin: build simple systems that do one thing well, and then glue them together.
-make files (SSTables) immutable, use garbage collection on them
-related work, they say other systems included too many features (chord) such as churn, heterogeneity, trust.
-distribute everything possible
-no single point of failure (chubby, master not bottleneck)
-scale = better perf. amdahls law, figure out how much you CAN parallelize, then do it. master doesn't hold all metadata
-measure everything
-great engineering
-compression (lots of duplication with versioning = big wins)
-single commit log per tablet server
-client library caching tablet location

Monday, February 9, 2009

All I Need is Scale! - Slide deck review

Notes on eBay slide deck:
-F500 slide on advantages of elasticity. cost over time before and after image looks familiar to the abovetheclouds work we've been doing recently, paper to come soon.
-Actually, the "faux PaaS" all look somewhat familiar to our "challenges and opportunities" as well.
-Research Challenges slide is a good source for project ideas for this cloud class! PAR Lab folks are tackling the multicore via programming languages approach. I particularly liked the n-resource, m-constraint and model-driven architecture challenges.

Sunday, February 8, 2009

eBay's scaling Odyssey

Notes and thoughts as I read through the assigned eBay slides:

The first slide claims that they "roll 100,000+ lines of code every two weeks. That is pretty incredible, considering how little the basic functionality of eBay seems to have changed in the last 5 years. Where is all of that code going? Is it all under the cover changes used to support better scaling, or is it code to support new features that I just don't use?

I was using eBay last week, and it defaulted to a "new" interface, which did not allow me to use the back button (in Safari) to return to the page of search results after viewing an item. How could they get that wrong!? Sorry, just me venting about how, in spite of all of the energy and thought that they clearly put into their service, in my opinion, eBay seems to get better and worse over time, not simply better, as you would expect.

Interesting point partitioning session state from application level since user info and state must span diverse array of existing and future applications within the eBay framework (e.g. eBay and eBay motors).

They claim to support multiple levels of SLAs as a function of the "given logical consumer." I wonder exactly what differentiates one logical consumer from another, e.g. do more active eBay users get better response times? Or are the users they are referring to here internal to eBay?

I like that they use a different consistency model for different elements of their systems. I doubt that they use a single storage system that supports the diversity of their consistency requirements. I doubt it; I imagine they use a different system for each.

I like the sound of a Model-centric architecture because it sounds a lot like RAD Lab rhetoric.

I had to look up what an Erlang-B is.

I especially liked the Ronald Coase (Nobel Laureat) quote on the third to the last slide with regards to whether eBay will be moving into a "cloud" and think it has special relevance to the course on cloud computing for which this reading was assigned because it speaks to the economic justifications for the existence of corporations, which are very similar to the economic justifications for the existence of public and private computing clouds. eBay might find themselves moving in the direction of what we at berkeley have dubbed a "private cloud".

Wednesday, February 4, 2009

DCell: Recursive Data Center Networking

The main idea here is to capitalize more on the recursive nature of networks to provide better performance and fault tolerance.

Policy Aware Switching

I blogged here about the policy aware switching paper last semester when we read it for Randy's networking class.

A Scalable Commodity Data Center Network Architecture

What is the problem? Is the problem real?
Using high end interconnects like Infiniband and Myranet is too expensive so we want to build Datacenter networks out of commodity switches and routers.

What is the solution’s main idea (nugget)?
They design a fat tree network architecture for the datacenter entirely out of low cost commodity components with the scaling, economic, and backward compatibility properties that we desire.

Why is solution different from previous work? (Is technology different?, Is workload different?, Is problem different?)
I believe that the use of fat trees in the data center is a very old idea.

Here is a blog post I wrote last semster about a talk on this research that Amin Vahdat came and gave at the weekly Berkeley systems seminar.

Thursday, January 29, 2009

Failure Trends in a Large Disk Drive Population

Summary of the paper:
This seems like something we could (and probably should) reproduce in chukwa. Also, we definitely should have cited this paper in the Chukwa CCA paper.

They collect the following metrics about disks into Bigtable (which is on top of GFS), and then analyze it using Sawzall:
-environmental factors (such as temperatures)
-activity levels
-the Self-Monitoring Analysis and Reporting Technology (SMART) parameters

They brag about the size of their dataset. Concise statement of interesting findings near the beginning of the paper:
-little correlation between failure and temperature or activity levels
-strong correlation between failure and some SMART parameters
-many failures happen without SMART pre-indicators, so SMART data alone is not sufficient for predicting HDD failures.

This makes me wonder if the hard drives in the R cluster nodes are SMART disks. Can we collect this using a Chukwa adaptor?

Their pool was over 100,000 disks. They point out that the definition of "failure" is highly variable, and thus many disks sent back to manufacturers are deemed operational by the manufacturers tests.

Interesting point about filtering for "spurious" data, which they define simply as data that has an impossible value (e.g. negative, 100000 degrees F).

Failure and Utilization:
Bucketing their disks into high, medium, and low overall utilization, they observed higher mortality rates during infancy and at the 5 year mark.

SMART data:
In a word, useful. They found a variety of SMART parameters were useful for predicting disk failures. This includes scan errors, relocation errors, offline relocation errors, probation counds, and many others. E.g. "After the first scan error, drives are 39 times more likely to fail within 60 days than drives without scan errors then than those with none."

They also, and almost unintentionally, provide a subtle commentary on the nature of standardized device protocals such as SMART. Some of the SMART parameters defintions varied by manufacturer and some of the parameters Spin Retries were never observed, which in such a large population implies that either they are measuring something unimportant (i.e. which trivially does not occur) or are not actually implemented by the devices.

They tried to predict failures based on SMART data, and found that they could never predict more than half of the failed drives. If they had more features, perhaps more complex data (which maybe they do but just didn't feature it in the paper), It sounds like a place where someone in the RAD Lab would think to fruitfully apply machine learning

Overall comments:
They do a great job pointing out and presenting satisfying explanations for the unexpected findings regarding disk failure trends and correlations.

I don't know that the subsubsection in 3.5.5 on vibration contributed to the paper much.

Data Center Failure Stories

First, this story offers some insights into datacenter failure models. I found it interesting that these were not, for the most part, faults due to ignorance. They were faults that occurred despite a lot of engineering efforts, testing, and close supervision of the production systems. Bus Duct, which I had not heard of before reading this article, seems to be specifically designed for the industrial purpose for which it was being used, so finding somebody to blame for the failure would be difficult in the failure circumstance described.

I think the lesson to be learned is that unpredicted failures are going to occur, despite the best engineering efforts, as the second assigned reading confirmed, it says:

"81 percent of respondents had experienced a failure in the past five years, and 20 percent had been hit with at least five failures."

One area of emerging research that we are addressing as part of the RAD Lab project is the use of multiple data centers. In the 90's we changed the world with the NOW (Network of Workstations) project. Maybe one of the current decade's memorable Berkeley systems research projects will be NOD (Network of Datacenters).

reminds me of a topic which we recently explored in discussions about cloud computing here in the RAD Lab: reputation fate sharing. In the horror stories presented in this article, the Colocation facility and the government agency running the datacenters that failed were reputation fate sharing with the construction companies they contracted with and the equipement manufacturers they purchased from.

The third article on datacenter failure was actually documentation for a sun directory service backup strategy. As a document aimed at the administrators of a real system, it is highly pragmatic with advice about inherent trade-offs such as deciding at which level of the architecture and which format to backup data (in their case, binary replication at the filesystem level vs other formats).

The issue of which level to build replication and fault tolerance into a system architecture also came up during the discussion I reference above about the use of multiple datacenters for cloud application fault tolerance. Somebody pointed out that one data store technology did intelligent replication at the application level while another punted on the issue, delegating fault tolerance to lower levels of the architecture. The classic end-to-end systems principle comes to mind here, and I believe that there is a direct trade-off between the latency of inter-datacenter replication (and thus consistency in the data store context) and the amount of information you can give the system. For example, if HDFS were to provide cross datacenter replication, in the absence of any hints it would simply replicate all files at the block level (using compression of course), while perhaps only a small subset of those blocks would be sufficient if the application were willing to participate in the replication process.

This third article also present elements of the sort of preparedness strategies that the first two articles so strongly admonish, complete with a "for dummies" decision flow chart.

Saturday, January 24, 2009

Introduction to Network Computing

This website is about as useless as network computing was. Though I don't blame Ion for not being able to find anything good about the failed endeavor.

One thing I did learn was that Apple made a game console/network computer called Pippen, which apparently, like all things NC, failed miserably.

An experimental time-sharing system - Fernando Corbato

First of all, this paper is old, 1962

Memory challenges were identified: things about isolation

Programming problems were identified: accounting, supervisor must manage shared IO resources, need good programming tools

Other problems: already thinking about failure
Their design: basically how we would build this today, probably because this was the an ancestor to our current time sharing systems.

some of the most challening problems:
-HCI, which required considerably more experimentation and evaluation
-multiple terminals communicating with one program
-get some disks (only had tapes)

haha, operators take commands from "the supervisor", i.e. root, e.g. change this tape out

They use a priority queue scheduling policy, assigning priority of processes based on number of words in the program (equation (2)). This seems like a very crude first stab, essentially rating based on how fast we can get the process into and out of memory from tape.

They run a job from queue L_n for 2 * l quanta of time, and if it doesn't finish then they move it up to queue L+1. They run everything at queue level L (i.e. until level L is empty), then everything at level L+1 until L+1 is empty, etc. If a job enters a queue at a level lower than the currenly operating level, they jumpt down down to it (i.e. to that queue level) and start the process over from that level.

They keep efficiency of system > 1/2 by enforcing the policy that the time quantum must be larger than the context switch overhead time (equation (3)).

They also give some numbers for what parameters would be for the IBM 7090