Tuesday, November 25, 2008


Sweet! A crap ton of feedback on our work and ideas for future work all for free! What a good idea to have the class read this. I agree entirely with Yanpei's observations about our work, that we should be looking for the general problems in MapReduce, and even more generally in distributed systems.

In the spirit of Matei's self criticism, you might find our summaries of the feedback we got from the OSDI reviewers interesting.

I would argue, however, that by looking at what Hadoop got wrong, we learn more than "Hadoop developers are bad coders" because they are not. Rather we learn what are the non-trivial parts of a disturbed system. In the mapreduce paradigm, backup tasks are SUPER important. So regardless of whether Hadoop had done a crappy job on them in the first place, research into the complexities of scheduling in real world distributed systems--and especially principled reasoning about lower bounds on running time of solutions to the problems we face--is an area of research that offers a delightful blend of theory and systems building (engineering?).

I'll take this work over Kurtis's science any day.

Some thoughts from 1) my head and 2) meetings with Matei and Anthony earlier today:
-how does the minimum makespan scheduling algorithm change when you introduce stragglers (multiple copies of some tasks)
-we need to build a model of the distribution of the jobs and also of the speeds of the nodes
-how does being able to bring up an extra node (i.e. EC2) effect the model?
-low hanging fruit: fix the dumb 1/3 1/3 1/3 task progress reporting problem in hadoop.

-rework the scheduler to use chukwa data to get a huge improvement in performance. what is the simplest way to use this data to be smarter?
-we would want to keep an extra field in each "task tracker" object on the jobtracker that just queries the task trackers about
-we would also want to keep track of a list of features which each has a number assigned with it that represents the average effect of that feature on task response time.
-are there papers on simple models for using feedback like this in scheduling?

-get chukwa running on all nodes of the R cluster
-get chukwa running on EC2
-get chukwa collecting xtrace stuff from hadoop

More middlebox work (policy aware layer 2)

I saw Dilip give a talk about Policy aware switching at syslunch last year, and have liked the idea ever since.

The premise is: Given that we middleboxes are here to stay and account for a huge percentage of network errors because they are so complex to configure, we should use a mechanism that doesn't require middleboxes to be placed on the physical path between end points. PLayer allows for policy to be specified at a centralized location using a standardized API.

I know that we just end up repeating ourselves with each new idea that we address in this class, but on the Yanpei scale, this one seems to be in the earlier stages of progress towards practical or popular adoption, which I guess is good for the authors since it leaves room for more papers as progress is made!

I loved that they included a section on formal analysis, and what i feared would be dense math that I would want to skip over turned out to be simple enough to make sense and add to the paper.

Random thought

I am also taking CS270 Algorithms this semester and think that exposure to nearly all of the theoretical algorithms we are studying would be much more effective if they were presented in the context of the systems that they are most useful in. This is why I love these networking papers so much, in spite of my/our ever sounding complaint that these aren't practital enough for Cisco to pick up.

Tuesday, November 18, 2008

DOT - An Architecture for Internet Data Transfer

The idea here is to have a service for data transfer over the Internet. They propose that a clean interface for data arbitrary data transfer which separates the mechanism for content negotiation from data transfer.

My objection is that such a unified service introduces a new single point of failure into the network.

Not to mention the standard criticism that this is a pie in the sky, with not practical, implementable solution that could be used to improve the data transfer mechanisms in use today.

Thursday, November 13, 2008


X-Trace and Andy, sitting in a tree...

Yeah, if I could marry X-Trace, I probably would.

Matei and I worked on X-Tracing Hadoop last year. See the project page here: http://radlab.cs.berkeley.edu/wiki/Projects/X-Trace_on_Hadoop

And right now, I am working with some other X-Trace folks on a CS270 project involving X-Trace graph processing that some of y'all might find interesting. The very earliest version of our project proposal (which will hopefully turn into a full blown project report over the next four weeks) is available for a quick read here: http://docs.google.com/Doc?id=dswvrwj_125hj6fh2dn

Thursday, November 6, 2008


Hey, I've used the i3 cluster, cool!

The idea here is that the original point to point design of the Internet is not conducive to certain applications such as multicast, anycast, and mobile hosts. The solution is a generic overlay network on which these networking models can easily be built. This avoids the problem that application level solutions encounter which is construction of redundant underlying mechanisms between multiple applications.

i3 works by presenting a service which allows sources to send to a sort of virtual host (i.e. a logical identifier), and then receivers can pull from that identifier.

This paper provides a very thorough look at the idea they are presenting. I really liked that the paper was so broad in its analysis of the idea, implementing several non-trivial applications (such as a mobile host solution and scalable multicast), discussing scalability, security, efficiency, deployment, etc. I also like that it built on another interesting Berkeley research project (Chord).

Wednesday, November 5, 2008

Delegation Oriented Architecture

Middle boxes, though scorned by Internet architect purists, are here to stay. Incrementally adapting the architecture of the Internet to play better with middle boxes can lead to improvements in performance, and more flexibility in

They propose two properties:
1) packets contain the destination, which is represented by a unique identifier (EID)
2) hosts can delegate intermediaries. I.e. middle boxes through which all traffic to the host must flow.

How it works
EIDs are resolved to IP addresses or to other EIDs. This is done by a resolution infrastructure, for which distributed has tables would work well.

Thoughts on the motivation
I am skeptical of their claim that Internet architect purists typically react to middleboxes with "scorn (because they violate important architectural principles) and dismay (because these violations make the Internet less flexible)."

They present two tenants of the architecture of the Internet which middleboxes violate.
1) all hosts should have a unique identifier (NAT obviously violates this).
2) only the host which was listed as the destination should inspect the packet's higher layer fields

However, I don't think that NAT is as bad as they claim. They argue that NAT "has hindered or halted the spread of newer protocols [such as one I have never heard of, SIP] ... and peer-to-peer systems." To me, these are not compelling reasons to bide by tenant 1, neither is the argument that Internet architects think that tenant 1 is elegant or pure.

In section 2.2. they present some more reasons why NAT is a pain. They claim that it is hard to set up servers behind a NAT. From my understanding though, NAT is a policy decision which is usually used to protect the user hosts of an organization, not its servers. Servers can, should, and are currently treated specially, often living outside of the NAT walls. I don't agree with the authors that this is a weakness of NAT.

Thoughts on the potential adoption of DOA:
From the paper: "DOA requires no changes to IP or IP routers but does require changes to host and intermediary software." -- Sadly, any modification to the Internet that requires a significant change to host software (except perhaps IP6 which is going to be do or die soon) will probably never leave the ground. Not only do they propose a new network stack on the end hosts, applications would also need to be ported to use the new interface (which they present as an extension to the Berkeley sockets interface). What is more, DOA also would require the creation of an entirely new resolution system for the EID->IP/EID mappings.

Finally, they themselves admit in 3.4 that they are not offering any new functionality, but rather a new architecture. While interesting from a research perspective, to me this admission symbolizes the reason DOA is not a practical reality.

Monday, November 3, 2008

DNS Caching

This paper used a combination of simulation and analysis of real world traces for a study of the use of DNS and the effects of caching on it.

They discovered that DNS lookups follow a ziphian distribution and that negative caching is useful.

Development of the DNS

I have set up BIND and MS DNS servers before, but never thought about DNS from a research perspective. Also, before reading this paper, I didn't realize that the HOSTS.TXT file was the primary address resolution mechanism for the Internet before DNS came around.

I thought that the paper was written well, focusing on the parts of the problem that were interesting from a [distributed] systems perspective, and thus has that sort of timeless feeling that so many of the classic systems paper seem to have, as valuable for its historical perspective as it is for the systems principles it presents.

The paper covers the basics of how DNS works, some of the design trade offs the engineers faced and finishes with three discussions, surprises, what worked, what didn't work. I very much enjoyed this layout.

Some other thoughts/nitpicks:
  • It is interesting that "provide tolerable performance" came last on the list of design assumptions behind DNS.
  • I thought the negative caching surprise was very cool, and it had never occurred to me that DNS used such a mechanism.
  • Their discussion of using datagrams (UDP?) and not TCP was confusing. On the one hand they see much better performance, but on the other hand they realize that they are reinventing much of TCP.
  • in 5.4, they say that when root servers return the name of a host they also pass the ip address of that host for free. which sort of queries does this apply to? wouldn't it be a name resolution query, in which case the ip address of the host was passed as input? I know I am simply missing something here.
  • Interesting point in 5.5 about how Administrators might choose short TTLs in a misguided move to get their changes (which are rare) to take effect rapidly.

Thursday, October 30, 2008

Looking Up Data in P2P Systems

This article begins with a discussion of centralized vs non centralized indexing structures in P2P systems. Entirely centralized systems, such as Napster, contain a single point of failure. This is the fundamental problem which subsequent systems aimed to solve. The article moves on to a discussion of some such symmetric, structured systems. They contrast Chord's circular skiplist routing scheme with tree-based routing schemes used by Pastry, Kademlia, and Tapestry with CAN's d-dimensional space based partitioning abstraction (which makes intuitive sense to those who are comfortable thinking in spatially).

They make a largely unfounded claim that P2P systems are useful for a lot more than illegal music sharing. However, for the most part, P2P systems haven't seen widespread adoption for problems other than illegal file sharing (you nerds in the back of the room are quick to point out that bit torrent is used to distribute various flavors of linux, but that was an already solved problem, sit back down). Why?

To rephrase, why do the LARGEST distributed systems in the world, those run by internet giants like Google and Yahoo, still run software stacks (such as MapReduce, Chubby, etc.) that are engineered to contain a small number of points of failure? I say it is because we have come to realize that the performance and complexity cost of making our systems entirely symmetrical is not justified. Instead we use replication and psuedo-distribution of mission critical centralized services to provide resiliency.

The only time we find value in a fully symmetric system is when we don't want to or can't have a centralized authority. However for any legitimate business use, we DO have a centralized authority, that being the business itself. This leaves a very narrow market for the justified use of these symmetrical systems, i.e. illegal or ad-hoc file sharing.

Before this, I hadn't read about any DHT's other than Chord, so I appreciated the abbreviated introduction to such a variety of other DHTs.

Tuesday, October 28, 2008

Active Networks

ANTS is a network toolkit that allows any user to inject his own code into the network layer, which applies only to his own packets on the network.

Their architecture uses Active Nodes which recognize and run network code which is embedded in packets (which they call capsules). They do provide a road map for incremental deployment into the current Internet.

This idea shares a lot in common with Overlay Networks, but includes lower level hooks.

Resilient Overlay Networks

Streaming thoughts as I read:

An application level networking protocol that sits on top of traditional IP and performs better both in the face of failures and generally (in terms of loss rate, latency, throughput).

-in the abstract they say that RON proved to be highly resilient to 32 outages, each over 30 minutes. However, they don't tell us how the traditional routing protocols (i.e. BGP) fared at handling the same failures. They get to this in the introduction, but I think they should cover it in the abstract.

-Is the performance of a twelve-node RON, representing 132 measured paths, representative? does this scale?

-plausibility of use in real world when ASs _are_ competing businesses? good for schools or other cooperating ASs i guess.

-they list some applications: overlay vpn and overlay ISP. has anybody built either of these?

-is a 5 percent loss probability improvement in 5 percent of samples significant enough to brag about? Did they see _any_ improvement of the same metric in RON_2 as well? if not why not?

-they found cases where RONs latency and throughput optimizing path selection algorithms found different paths? Couldn't this just be correlation and not causation (probably/hopefully addressed in the evaluation section)

-Because routing is done at a higher level, routers can make routing decisions based on metrics which are relevant to the service they are supporting.

-Uses Link-state routing by default.

-RON allows for a lot of policy to be specified as part of the routing policy.

-They implemented RON as a c++ library. They implemented a RON IP forwarder.

-Summarized nicely in table 1, they looked at three things: 1) RONs ability to detect/recover from dropped links. 2) RONs performance improvements in latency, throughput, loss rate. 3) stability of RON routes.

-Re 1, how big of a problem is this really? I am not convinced that this is a real problem with traditional BPG routing. Re 2, the improvements were good, but not great.

-They address problems that RON introduces with NATing, trust issues (RONs require trust), and scalability ("the RON design trades scalability for improved reliability")

Sunday, October 19, 2008

On Power-Law Relationships of the Internet Topology

This paper was a refreshing diversion from many of the protocol and systems building papers we have been reading. They grabbed three graphs, each separated by 6 months, of the Internet from the National Laboratory for Applied Network Research and plotted some simple statistics using log-log scale and noticed a lot of straight lines. Ha.

Axis of each powerlaw:

  1. <node rank (based on # of fanout connections from node), fanout>

  2. <node rank, frequency (bucketing fanout)>

  3. <# of nodes within range, radius of range (# of hops)>

Are #'s 1 and 2 very different?

Distributed Diffusion

This seems like a vision paper wherein, instead of talking about motes that currently exist, they speak about the "expected capabilities of sensor nodes." Judging by the motes we looked at in class Thursday, their guess was decent in terms of size and performance though the TAG paper focused on motes with much slower processors.

While they start with the observation that the organization of a sensor network can "begin to look like a distributed computing system," they are quick to point out that new techniques will be necessary to account for node failures, temporary outages, and power limitations.

As opposed to TAG's SQL like declarative language, Directed Diffusion uses attribute-value pairs to specify a set of data that the operator is interested in. Interests are diffused into the network and when a mote has a response the answer is sent along the reverse path that the interest came in on.

More simulation, but then that makes sense because it sounds like real sensor nets are still a brain child at this time.

Figure 3 is confusing.

The examples they use throughout the paper (tracking animals) are somewhat less compelling than the pie in the sky applications enumerated at the beginning of the paper. As we discussed in class, the reality of sensor nets has been somewhat less impressive than the early dreamers envisioned (remote geographic regions or toxic urban locations). Though I believe that some grad students here at Berkeley have done some work with sensor nets tracking sun exposure around berkeley and in forest settings (I think somebody gave a sys lunch talk about this last semester), so maybe we were being a little premature in class when we bashed this paper.

TAG - for sensornets

This paper explores an in-network solution for aggregation services in a sensornet.

The basic idea is that by thinking hard about the various types of aggregation functions that we find useful we can devise ways of cutting down on the amount of network overhead required to inject a query into the network and collect a response. In particular, we should throw data away at every available opportunity.

I really liked the taxonomy of network queries they presented, but i am fond taxonomies in general, so i might be biased.

My complaint is the typical one, that while sensornets seem to promise the world, more details about a real world deployment which actually worked would be nice, simulations only take us so far.

I also found the math in 3.2 unnecessarily complex, or under explained, or maybe I was just too lazy to re-read the section a couple of times to really understand what was going on.

Thursday, October 9, 2008

XORs in the Air

This paper discusses the technique of capitalizing on the broadcast nature of wireless networks. Since nodes in a wireless network unavoidably overhear packets that weren't intended for them, we have the opportunity to capitalize on this by XORing multiple packets together and sending only one packet the encoded form of arbitrarily many packets. This is genius!

Monday, October 6, 2008

Ad-hoc routing protocol showdown

This CMU paper from mobicom '98 uses simulation to compare the performance of ad-hoc routing protocols DSDV, TORA, DSR, and AODV.

They had to modify the ns network simulation software to account for node mobility, a more realistic physical layer, wireless NICS (not just wired), 802.11 MAC using DCF.

They also implemented each of these protocols and discovered, along the way

Not surprisingly, they found that node mobility has a negative impact on performance. Beyond that, the protocols break into two classes. This is obvious when looking at the shapes of the curves in figure 3. As the paper puts it, "TORA, DSR, and AODV-LL are all on-demand protocols, and their overhead drops as the mobility rate drops. As DSDV-SQ is a largely periodic routing protocol, its overhead is nearly constant with respect to mobility rate."

In a recent conversation with Ion Stoica during a meeting about experimentally reproducing a transport layer congestion event coined incast, I inquired as to the value of simulating incast vs. reproducing it using a real switch and nodes. Ion was less than enthusiastic about simulation, so we are moving ahead with a real experiment. This made the think about how there seems to be two camps in network research, those who simulate and those who run the real experiments. We have a number of paper from both camps. Here are just a few of the papers we have read which fall into one or the other.

  • XCP
  • RED
  • CSFQ

Real Experiments
  • Macaw
  • Mechanisms for improving TCP over wireless (Snoop)
  • Roofnet
  • ETX

It seems to me that simulating wireless networks is more difficult than simulating wired networks because there is more variability in their performance, and with wireless networking technology costing so little, what are the reasons that any research group would continue to simulate? One argument that comes to mind is that a good model allows and encourages us to more thoroughly understand what is causing the effects we observe in real life. This sort of reasoning leads to an argument that we should probably both simulate and then verify our simulations on real hardware for all networking research projects (and perhaps all systems projects in general). Of course, this takes extra time and effort which we might not have to spend. Also, I should go back and review the "Modeling Wireless links for Transport Protocols" paper to remind myself of the arguments they present for modeling/simulating.

Thursday, October 2, 2008

Modeling WLANs

  • Figure 1 and Figure 2 are quite confusing
  • I thought that most of the internet was still running TCP Reno or Tahoe, these guys claim that almost everybody is using SACK or NewReno
  • They claim that a weakness of using models is lack of reproducibility, but isn't this just a problem of bad writing, i.e. oversight by the writers. Even when doing real experiments instead of modeling experiments you can omit details about the experiment setup.

Unplanned WLAN mesh networks (Roofnet)

An MIT project to create a reliable mesh network by giving (selling?) wireless access points and roof mountable antennas to people around Cambridge.

Drawbacks are that people need to install an antenna on their roof, but this pretty much unavoidable unless nearly everybody in the city had an access point.

While I was living in Madison, I considered signing up for Mad City Broadband, which was a more structured commercial endeavor to provide wireless access. However, in my building, reception was sketchy so I didn't go with MCB. The antenna on the roof could have solved the problem, but I didn't have roof access in my apartment building. This seems like a flaw in the roof mounted antenna solution.

Monday, September 29, 2008

Solving problems in the early days of wireless

Is it weird that this paper from '94 feels like ancient CS history? Maybe it is just in my head, but reading any paper from PARC causes me to assume the paper is likely old and likely influential.

This paper picked up where Phil Karn left off with MACA which was "just an idea" that Karn had introduced to handle the hidden terminal scenario more effectively than CSMA.

They proposed that collisions should be detected and avoided at the receivers, an idea pioneered by Karn with Multiple Access Collision Avoidance ("MACA is an acronym, not a spanish word"). This is in contrast to detecting collision from the sender's perspective, as was done with CSMA (Carrier Sense Multiple Access).

As Karn said, "Much simulation and experimental work needs to be done to answer many questions about how well it [MACA] will really work." Shenker et al took this to heart, and proposed MACA for wireless, or MACAW, which addressed some of the issues they observed in MACA, such as the use of Binary Exponential Backoff (BEB) which quickly led to unfairness in collision resolution.

They also address the problem for nodes [A] [B] [C], if C overhears an RTS from B to A, it can't tell if B is sending since "C cannot tell if the RTS-CTS exchange was a success and so does not know if B is indeed transmitting data." They chose to solve this by having the sender emit a Data Sending (DS) packet.

They also introduce, with nearly comic academic timing, the Request-for-Request-To-Send (RRTS) packet, which solves another fairness problem of two receivers in range of each other with senders out of range of each other and with one pair transmitting much smaller packets than the other. The RRTS solves an unfairness problem which feels a lot like the problem introduced by BEB.

  • Multiplicative Increase Linear Decrease (MILD) of the backoff interval seems very similar to the concept of Additive Increase Multiplicative Decrease (AIMD) of cwnd in TCP congestion control.

Improving TCP for wireless

The authors of A Comparison of Mechanisms for Improving TCP Performance over Wireless Links first build a nice taxonomy to classify 12 schemes used to handle non-congestion related losses in TCP (see table 1). They then analyze each of the mechanism in turn.

This paper served as a nice forcing function to solidify some new and formerly vague networking concepts in my head such as SACK, Reno vs. NewReno, and split-connection protocols.

This paper was also an exercise in exploring the sort of trade-offs that were discussed in the classic end-to-end paper we read earlier this semester. In particular, their findings show one of the largest performance wins came from pushing knowledge from the ends (TCP) down two layers in the network stack into the link layer. I find this paper to represent an wonderful exposition into the final sentence of the end-to-end paper's abtract: "Low level mechanisms to support these functions are justified only as performance enhancements."

Monday, September 22, 2008

Parallelism in the router, i.e. switched backplanes

After reading today's paper on the gigabit switched router, I realized that I should have read it before reading today's other paper because this paper is more approachable and didn't assume an intimacy with internet routing technology. Figure 2 even uses a picture to explain the evolution of high performance routing! What is more, I now know what they were talking about in the other paper when they spoke of router scheduling and VOQ.

They are engineering a 16 port x 2.4Gbps per port = 38.4Gbps

Shared backplanes could (at the time of writing) support 20Gbps, not enough.

They present iSLIP, a simple effective router scheduling algorithm.

While iSLIP allows for high throughput (100%) of the router's backplane, it does not address delays which might be problematic for latency sensitive applications. Thus, they introduce the intuitive mechanisms of prioritization and speedup (get all packets to the output port as quickly as possible so they can be scheduled).

Random notes:
  • average internet packet is 250 bytes
Keep this paper on the list! It is very informative for those of us who don't know router architecture coming into the course (which I imagine might be a non-trivial portion of us).

Sunday, September 21, 2008

Blah blah blah, optics, blah blah blah, router.

I nominate Scaling Internet Routers Using Optics as the least fun paper to read so far this semester. They describe in excruciating detail two designs of multi-rack routers (1-hybrid electro-optical, 2-fully optical switch) which achieve 100Tb/s load balanced routing. Their router gives 100% throughput without a scheduler. They address four obstacles to the load balanced switching approach:
  1. Rapidly configuring switch fabric is slow or expensive.  Solution: replace switch fabric with fixed mesh of optical channels
  2. Packets can be mis-sequenced.  Solution: (Full Ordered Frames First) bound the length of queues at middle stage and introduce a re-sequencing buffer at a later stage
  3. Periodic traffic can cause bad throughput.  Solution: same as for problem #1 (replace crossbar switches by fixed optical mesh)
  4. Missing/failed linecards are not easily handled.  Solution: presented in sections 6.2 and 6.3, they present two solutions that allow the switch to be reconfigured to scatter traffic uniformly over all present linecards in the face of some linecard failures.
The are designing for ~3 years out (in 2003), I wonder if the future aligned with their predictions. 

The paper was a bit dense on technical/hardware details so I ended up being lost for most of it :-/

Thursday, September 18, 2008

Splitting up Spectrum and wireless transmitter fingerprinting

Solutions for sharing bandwidth
  • Assign a different channel to each AP
  • Load balance
  • Restrict range of each AP
  • Scheduling
Using Channels
  • Least congested channel search (LCCS)- APs talk to each other
  • LCCS is not enough because clients can end up seeing two APs that can't see each other
  • Solution: get feedback from clients, need to do set coloring (conflict free set coloring)
  • Partially overlapping channels are frequently not used (e.g. ch 1 and ch 2 overlap a LOT so we don't use both of them), this causes wasted space (fragmentation of a sort) in our spectrum
  • Solution: Model interference for partial overlap. I-factor == 0 if non overlapping channel, I-factor == 1 if same channel.
  • I-factor = overlap between transmit spectrum mask and receiver's band pass filter profile

New problem: how to identify who is using our AP more w/ something less spoofable than MAC address. We could build Access Control Lists out of this that would work much better than MAC address white-lists.

PAssive RAdiometric Device Identification Scheme (PARADIS)
  • Imperfections in the signal from each card, which are an artifact of manufacturing imperfections, are unique.
  • compute distance of "measured signal" from "ideal signal" along several axis: phase error, magnitude vector, frequency error, SYNC correlation, I/Q origin offset
  • Provides surprisingly accurate identification. They experimented on the Orbit Testbed which is 130 transmitters all of the same model# with sequential serial numbers, they saw 0.34% error rate, which is very impressive.
  • Audience Question: what can a well funded adversary do? Can he spoof the hardware signature? Answer: the adversary would have to be able to read your signature (like the defender/security system) 
Problems to design against
  • Temperature of NIC
  • Mobility (distance from AP?)
  • Nic aging
So, getting back to how we might use this idea for spectrum sharing, we might want to assign spectrum dynamically, and we would need to identify people more accurately to be fair with our allocation policies (i.e. leases).

Real Time Scheduling
  • Hidden Terminal aware Scheduling (HITS)
  • Hidden terminals can be detected by the higher level router which is sending packets that it can tell ahead of time are conflicting but the routers themselves can't see each other (by definition of hidden terminal problem). When the higher order router that has more information notices that two packets will conflict it can infer that there is a hidden terminal problem and it can delay the packets just right so that packets that would collide are not sent at exactly the same time.
  • Problem: we are not able to be accurate enough with our timings

The past future of the Internet

Today, we look at a paper written 13 year ago discussing the future of the Internet.

  • When the network can't provide enough reliability or low enough variance, application developers will solve the problem themselves (by using bigger application level buffers, etc.)
... Er, this is an incomplete post... More to come (I blame prelims!).

Sunday, September 14, 2008

A clean slate approach to Congestion Control

We've discussed a lot of papers about congestion control! The most recent one (XCP) represents a refreshing clean slate approach to congestion control. You have to respect the theoretical foundation (control theory) used to justify the design principles.

  • >1 bit feedback. Was the next logical step to take now that chip performance can handle it.
  • Separation of fairness and throughput policies for independent performance tuning and analysis

Random Early Detection (RED)

The next stop on our tour of the history of the Internet is another Congestion Avoidance protocol that exists at the gateway level. Random Early Detection sits somewhere between end point congestion avoidance and Fair Queuing in that it runs on routers, but not at the high granularity of individual flows. Instead it operates on summed averages of flows .

This work does not introduce Random Early Detection, rather it builds upon existing research on RED, contributing algorithms and simulations of more intelligent parameterizations than existing RED schemes used.

  • Flexible, works with existing transport protocols
  • Focuses on average queue size (using a low pass filter), allowing fluctuation in actual queue size to handle bursty traffic.
  • Avoid global synchronization by randomly choosing nodes to notify of congestion (via dropped packets or Explicit Congestion Notification) weighted by their share of the router's bandwidth.
  • Nice discussion of parameter selection, but (as Mate and I have learned) some real numbers as part of a sensitivity analysis showing the tradeoffs would be nice.

  • "Short packets are just a likely to be dropped as long ones"? I wonder, what would be the implications of adding packet length into the random weighting algorithm?

Why it should stay in the syllabus:
Because it fits so nicely into our historical tour de networking.

Thursday, September 11, 2008

Fat Trees for a scalable data center network architecture

Amin Vahdat from UCSD spoke at the UC Berkeley Systems Seminar today, this talk was a SIGGCOM talk. I scribbled some random notes.

Neworking is super expensive. Marketing literature brags about being cheap when they offer rates of $4000 per port (which is crazy compared to the expense of the actual pc of ~$2000).

They propose something like a Network Of Inexpensive Switches (NOIS - not the official name). Essentially their idea is to use a Fat-Tree!
The idea has been around for 50 years.
Why hasn't it been done?

High level issues with fat-tree that need addressing
  • We have path diversity, but existing routing protocols don't take advantage of that.
  • Cabling explodes!
  • We use 8 port switches throughout, lots and lots of them! Too many?
How is Intel doing this?
  • Routing
    • Localized load balancing switch by switch
    • Utilize our global knowledge for routing! Logically centralized routing brain (replicated, of course). The routers report their statuses to the CS (Central Scheduler).
    • Two level look-up (two tier routing table)

  • Simulations
    • Used Click Modular software router
    • The numbers looked good? Need more workloads. They are building a real prototype

  • Multicast
    • Again, use central scheduling
    • Central scheduler sets up routing tables for multicasting knowledge (eliminating broadcasts to unnecessary "pods" at earliest available point in routes)
    • If multicast pipe grows too large, the CS will know because it is getting reports from all routers, in response to status updates

  • Cabling
    • 14 tons of cabling in a 27,000 node DC!
    • Addressed by using optical. This will cost lots and lots of money, but they think people will buy into it.

  • Power and Cooling
    • I didn't catch this part, I think he might have said "we're working on this still"???

  • Work in progress: they're building it as we speak.

My thoughts
  • This seems a little bit like a bunch of small hypercubes?
  • They just cop out and fall back to optical anyway? Won't this cancel out any price wins? Why can't we do better cabling under the floor? A huge matrix of many many ethernet cable equivalents. 

James Hamilton on blades in data centers

James Hamilton blogged about a data center architecture question I thought others in the class would find interesting. In short: why not to use blades.

core-Stateless Fair Queuing

Now we move to an optimized version of Flow Control, where only the edge routers do the Fair Queuing and pass that information along in the headers of the packets, while the core routers can then be less intelligent (and thus more efficient).

This represents a natural evolution of the research being done by this set of folks.

Wednesday, September 10, 2008

A third look at congestion avoidance

This time we've moved from the edge into the gateways.

Their idea was great: let's run simulations to show that our modified version of Fair Queueing can outperform traditional congestion control mechanisms.

Much of the writing was comprehensible and well justified. It seems to me that this work was low hanging fruit. Though their categorization of the flaws in Nagle's version of Fair Queuing and the solutions they provide, supported by simulations which show large performance gains, were certainly well done and non-trivial.

I feel that they used math which was overly complex and under-explained in the performance section. For the majority of the paper, I was unsure what they meant by fairness with regards to promptness allocation, and a section of dense math didn't help much. I had to wait till the end of the section to get an understandable summary of the problem they were solving (i.e. "This ability to provide lower delay to lower throughput sources completely independent of the window sizes of the FTPs is one of the most important features of fair queuing") Also, Figure 1 was poorly explained.

Monday, September 8, 2008

A second look at Congestion Avoidance

In his paper Congestion Avoidance and Control, Van Jacobson basically presents the story of most of TCP Tahoe (all except Karn's clamped retransmit backoff and fast retansmit).

To justify the addition of the the seven new algorithms he introduces, he identifies three things that cause instability in network:

  1. not reaching equilibrium

  2. more packets entering than leaving

  3. equilibrium not reached because of data path limitations

The first can be handled with the ironically named slow start mechanism (which takes approximately time R*log_2(W) where R=rtt, W=window size).

The second, by better estimating round trip time for resend policies.

The third item I find a bit confusing. I think what he means by "resource limits along the path" is network congestion, so he solves this with congestion avoidance (i.e. additive increase/multiplicative decrease, which he is quick to point out he did not invent, nor even bother to justify in this paper).

He points out that dropped packets, as detected by timeouts, are enough for congestion detection and that explicit notification of congestion (as is described in today's other paper)

Like most of the papers we have read so far, this work is worth reading because it shaped the Internet, which in turn continues to shape the world as we know it.

Friday, September 5, 2008

Why to increase additively and decrease multiplicitively

In section 1.2 of this paper on congestion avoidance in computer networks, the authors present a summarized overview of the types of decision making that a system can incorporate, ranging from centralized to decentralized. They justify their decision to use a decentralized system.

I appreciated the way they articulate the classification of algorithms for managing distributed resources (centralized vs. decentralized). It is essentially one of the question we are wrestling with in the Chukwa project, and also the RAD Lab project. I intend to read some of the papers they reference when they describe this spectrum.

Zooming out, the overall thing that makes this paper amazing is the way they formalize, quantify, and visualize their algorithms.

Every systems paper should attempt to emulate the methodology found in this one. They begin by clearly articulating the problem being solved (Section 1.4) and defining the solution space (imagine a matrix with three rows: additive/multiplicitive/non-linear, and two columns: increase/decrease). They then construct a concrete mathematical model to quantify the performance impacts of the various points in the design space (the 2D vector representation of the Dynamics (2.1) was genius) which of course leads to an elegant, intuitive solution.

This paper seems to be missing empirical experimentation showing that the proposed optimal solution actually works in reality.

Thursday, September 4, 2008

Refocusing Internet Research?

I believe that the motivation for our reading Interdomain Internet Routing (Lecture 4) was to gain insight into the functional aspects of the BGP protocol. A little bit of searching shows that most of what has been published on BGP is of a technical nature.

As the beginning of this reading notes, however, the structure of the Internet--and in turn the BPG--is a function of the business model on which it is built. While everyone who writes about BGP is quick to point this fact out, they then proceed to tackle how the technologies facilitate the complex business relationships that exist among the various tiers of ISP's.

I wonder, is there a group of economists and business visionaries that are just as excited about the future of the internet, but perceive the problem from the inverse perspective? After all, technology isn't driving the internet, money is. The technology follows the economics. A lot has changed since the DARPA paper on TCP/IP was written. In those days, the objective of the Internet was to connect people that "needed" to be connected. At that time, researchers were inspired by their national spirit to build something in the name of National Defense. At least that is the romanticized version of the story that I have in my head. Is that how things really were?

Today, we don't even pretend like the Internet is a national icon, we openly admit that the Internet is a conglomeration of independent business entities (which represent their share-holders, not red-blooded American's everywhere) whose only motivation is to generate a profit.

I bring this up because, in light of this perceived transition (i.e. growing up?) of the Internet early in its formative years, I am surprised at the difficulty I have finding research into the economics that drive the development Internet. Rather than founding our guesses for the future of the Internet on things like Moore's Law, Bell's Law, or any other of today's technology trends, shouldn't we be listening more closely to Economists (and laws more like Metcalf's Law), whose job it is to study the nature and behavior of the System that sits at the core of the Internet, which is essentially a business system, not a computer system?

After reading the other paper that Randy assigned for today, I am even more convinced of the sentiments I expressed above.

What made this paper interesting was not the math or the techniques that they used, but rather the big picture question that they were asking, which nobody had even thought to ask up until that point. They asked: what are the relationships between AS's on the Internet? That this question remained unexplored until 2000 only supports my above observations that we were (and continue to be) focusing on the technology at the center of interest and the business model as the secondary interest, while it should be the other way around.

Tuesday, September 2, 2008

OSI vs. TCP/IP stacks

During our discussion in class today, I kept meaning to ask why we have two networking stack specifications. We now know a lot more of TCP/IP's history, but why was it even developed if we already had the OSI model? Did we already have the OSI model?

This webpage (the first search result returned by Google) offers a brief explanation of the differences between the two, and explains that "The adoption of TCP/IP does not conflict with the OSI standards because the two protocol stacks were developed concurrently."

The Wikipedia article on the OSI initiative says that "OSI was an industry effort, attempting to get everyone to agree to common network standards to provide multi-vendor interoperability.

The wikipedia articles on both protocols (TCP/IP Model and OSI Model) are interesting further reading for anybody who is as curios as I was.

Saturday, August 30, 2008

The Design Philosophy of the DARPA Internet Protocols (Clark, MIT)

Briefly summarize the paper (1-2 short paragraphs)
A reflection on the birth of the the network and transport layers of the TCP/IP model protocol stack. Over 15 years countless individuals contributed to the development of a network stack that could support continued growth of the ARPANET. Their design was inspired by 7 goals:
  1. Robustness (it was a military project after all, network gateways might be bombed)
  2. Different applications need different service guarantees (i.e. TCP vs. UDP)
  3. Support variety in the Link Layer (not everybody was running a LAN, e.g. there were radio)
  4. since different companies/gov agencies will own different chunks of the Internet, management must be distributed (not centralized)
  5. Low cost
  6. Adding new hosts must be easy
  7. Accounting must be feasible/easy
This list was sorted by priority with the items near the bottom receiving little attention at the time of the protocol design.

One big lesson to take away from the entire process is that a system that does the job good enough can become popular and be very hard to replace with a future system, even if it does the job much better.

Provide relevant background/related material as appropriate (1-2 short paragraphs)
This was a historic paper, most of the related material consists of other historical papers.

Critique the paper and suggestion discussion topics (2-3 paragraphs)
At a high level, the paper does a great job distilling design principles out of a huge complex project. To be nit-picky, I found section 10 confusing and poorly worded. I got lost on the third paragraph of "pg 113." The paper was also wrong in its conclusion with the intuition that a better building block than the datagram might win over eventually.

Why or why not keep this paper in syllabus?
Keep it in the syllabus because it contains both a fascinating perspective into the history of the internet/networking as we know it as well as useful system design principles (e.g. references to the end-to-end principle, and insight into the iterative process of systems design).

What issues are left open for future research?
He basically ends the paper with a suggestion for a direction he calls "soft state" which he sees as an improvement over the purely stateless datagram model. To my understanding, this idea didn't represent the direction the Internet went.

What are the important implications of the work?
Uhh... The Internet. (e.g. this blog, the whole entire world, the future of humanity!)

End to End Arguments in Systems Design (Saltzer, MIT)

Briefly summarize the paper (1-2 short paragraphs)
This paper explains and explores the tradeoff between pushing functionality, such as encryption, duplication detection, and robustness mechanisms, into the lower levels of the network stack vs. implementing the same functionality at the end hosts of the network connection (i.e. at the application level).

The solution they promote is not simply to push things to the ends, but to be sensitive to the advantages and disadvantages of both scenarios. Each has its limitations. For example, encryption at a low level is not a strong enough guarantee for somebody who does not trust all other users on his machine since sensitive data would be moved between the process and the encryption point which would be low in the system stack. On the other hand encryption done at the application level cannot guarantee that all data going out on the wire is encrypted, which might be desirable from a network policy perspective as a safeguard against "wire tapping" or accidental transmission of sensitive data to an untrusted entity.

Provide relevant background/related material as appropriate (1-2 short paragraphs)
This paper was written as a reflection on the experiences at MIT related to this tradeoff space, including "A Too-Real Example"

Critique the paper and suggestion discussion topics (2-3 paragraphs)
This paper is amazing. I believe that my conscience would bother me I if deigned to react with anything but groveling at the feet of its authors.

Why or why not keep this paper in syllabus?
Absolutely keep it in the syllabus, it is a seminal paper that every CS researcher should read. 

What issues are left open for future research?
The principle is a general and useful one. They provide many examples of the trade-off space, but there are countless more that could be enumerated. In fact it is valuable to consider the end-to-end principle any time a network based system is being designed.

What are the important implications of the work?
We had significant discussion of the end-to-end principle during the design phase of Chukwa, the data collection system I helped build at Yahoo over the summer. The end to end principle is taught as a fundamental tool in all systems design work.