Below are some suggestions for course projects. Of course, you are free to pursue an idea of your own choosing. You are encouraged to consider projects broadly in networks and distributed systems. I will also consider projects that overlap with other areas such as architecture, theory, operating systems, databases, or artificial intelligence if there is a strong enough connection to some of the course topics we are studying. If you would like more information on any of the suggestions outlined below, please contact me. It is fine for multiple groups to be working on similar projects.
Evaluation of TCP for High-Bandwidth Applications
A number of recent efforts have demonstrated that TCP performs poorly under high bandwidth-delay conditions. Consider the case where two sites wish to communicate with one another at 1Gbps across the wide-area. TCP's severe reaction to a single loss event and the slow rate at which the congestion window ramps back up again make it difficult for TCP to deliver significant bandwidth across relatively large-latency and high bandwidth wide-area links. A number of recent proposals, including TCP Fast and XCP attempt to address this shortcoming of TCP. The goal of this project to evaluate how well these various efforts perform in delivering data across the Internet (we have a number of testbeds avaliable for high-bandwidth experimentation including Grid settings) while remaining responsive to other less resource-intensive TCP flows (e.g., simultaneous HTTP traffic). Based on your experience, can you suggest an alternative protocol or enhancement for efficiently sharing high-bandwidth wide-area links?
PlanetLab Project Suggestions (thanks to Brent Chun for some of these suggestions)Fast and robust wide-area remote execution
Deploying and managing wide-area services on PlanetLab requires coordination and control of geographically dispersed sets of computational resources. Currently, such tasks are performed using the Secure Shell Protocol (SSH) as a base primitive and by layering various shell scripts of top. This approach, while straightforward and practical, suffers across at least two dimensions. First, performance is abysmal. Each connection requires establishment of a new SSH session and the client is required to manage N such sessions, using some combination of serial and parallel connections. Second, control over large sets of remote processes is flaky at best. For example, a C-c will usually result in remote processes being killed; at other times, straggler processes remain. If the SSH client exits in an unexpected way (e.g., sent a SIGKILL, node crashes, etc.), the results are even less predictable.
The dual problems of coordination and control of a set of distributed resources has existed for quite some time in the parallel computing community. A recent example of a system that addresses this problem is the GEXEC remote execution system developed at Caltech. GEXEC provides fast, RSA authenticated remote execution of parallel and distributed jobs on a cluster of machines. Internally, it builds an overlay, an n-ary tree of TCP sockets and threads, which it uses to propagate control and data up and down the tree between the client and the remote nodes running a particular parallel program. By using hierarchical control, it distributes both the work and resource usage associated with large amounts of parallelism across multiple nodes, thereby eliminating problems associated with single node resource limits (e.g., limits on the number of file descriptors on front-end nodes) and achieving significant speedups as well.
The goal of this project is to build a fast and robust wide-area remote execution system for PlanetLab. It will involve leveraging the techniques used in GEXEC for scalability, constructing bandwidth/latency aware overlays (which are important since we cannot assume a fast local area network), and augmenting the approach taken in GEXEC to account for node and network failures to make it robust. Other differences that must be taken into account for the wide-area case include eliminating assumptions of a uniform UID/GID namespace and a shared filesystem such as NFS or AFS. The former might be addressed by bootstrapping a new RSA key pair over an initial SSH session to all nodes. (The cost of this gets amortized over multiple uses of the fast remote execution system.) The latter might be addressed by utilizing a scalable file distribution system (e.g., see related project on this below).Parallel pipelined wide-area file distribution
File distribution to a collection of remote nodes is a common task when developing and deploying wide-area applications. For example, binaries and related files often need to pushed out to nodes in a slice before a service can be instantiated. Today on PlanetLab, this task is accomplished by building on top of the SSH protocol. That is, files are distributed by essentially iterating over a set nodes (potentially in parallel) and using the scp program to push a set of files out. Distributed filesystems simplify this task, but still may suffer from performance problems since a common pattern during development is a write followed immediately by a set of N simultaneous reads. For example, while AFS employs client-side caching for performance, all readers would still need to initially perform a read on a central server hosting the relevant files.
Another way to view this common case of a write followed immediately by multiple reads is to view this as a reliable multicast problem. We can view the node pushing the update out as the sender and the nodes receiving the updated files as receivers in a multicast group. Taking this view, one possible way to approach this problem might be to simply create a multicast group, perform a barrier (i.e., wait until all nodes are ready to receive), and multicast the file to all receivers. Unfortunately, native IP multicast on the Internet is still not widely deployed (and is unlikely to ever be) and, additionally, does not guarantee reliable data delivery. An alternative approach and one which has seen significant recent research activity is to build an application-level multicast overlay, where the participating nodes are responsible for forwarding packets to the other members in the group (e.g., over a tree).
The goal of this project is to build the fastest wide-area file distribution system in existence using an application-level overlay. We ask the question: what is the fastest way to push a sequence of bytes out to N nodes in a wide-area network? Naturally, a system that addresses this question will need to employ various types of parallelism. For example, it might build an overlay out of the nodes as a tree and exploit parallelism across multiple paths and multiple levels in the tree (i.e., pipelining the data). Furthermore, because it operates in the wide-area, it will need to build the overlay to account for available bandwidth between the different nodes in the tree (i.e., it cannot assume a fast local area network). Finally, because all bits in the files being distributed are not always changing, the system may also want to try and exploit sending deltas, as opposed to entire files. A few useful starting points for examining this problem include Overcast, PCP, and the rsync program.
Kernel support for network measurement
One issue with hosting many virtual machines on a single node is the difficulty of accurately timestamping network packet behavior in the face of a large number of process/virtual machine context switches. The goal of this project is to design an interface that allows the operating system (e.g., linux) to timestamp packets using the Pentium's cycle counter when the packet first enters the kernel. A new system (e.g., readwithtime) would add an additional argument to the standard read() system call, filling in the time that the packet was received in the kernel. This interface would allow system developers to more accurately measure network behavior. After building the prototype, carry out some experiments with a number of competing processes to show the ability of indvidual processes to track network characteristics.Self-healing wide area network services
One of the key challenges in maintaining a highly available wide area network service is being able to automatically detect and respond to faults in a timely manner. For example, consider Akamai's content distribution network, which serves web content for such major sites as AOL, Yahoo, and CNN and relies on a overlay network consisting of 13,000+ servers scattered around the world. In such a system, failures in both software and hardware are inevitable due to the vast number of components involved. Manual detection of dead servers and the restarting of new ones becomes infeasible in such a system, not to mention being both time consuming and highly error prone. Instead, what you would really like is to have a system that is capable of both detecting failures in a distributed manner and automatically repairing them without human intervention.
Interestingly, one class of distributed "applications" that have this property and have been shown to be highly resilient in the presence of failures are Internet worms (see Code Redv2 and Nimda). For example, despite widespread filtering of worm replication network traffic at ISPs and the availability of software patches that fix the underlying vulnerabilities exploited by these worms, all of the high profile worms in recent years (e.g., Code Red, Nimda) continue to survive on the Internet. The ability for these worms to automatically replicate themselves on additional machines is a powerful capability. Early research investigation in early 1980s on worm programs that performed useful system functions also realized this, along with the associated risks of self-replicating programs.
"Even worse, however, were the unstable worms, which suddenly seemed to grow out of control..."
-- "The 'Worm' Programs -- Early Experience with a Distributed Computation" (March 1982)
The goal of this project is to leverage the power of self-replicating code and to construct a framework for building highly available wide area network services that other applications can use. Much of the related work in this area has in mainly focused on systems running on local area networks, where networks are fast and network partitions are rare. Furthermore, despite all this work, there has yet to emerge a widely used generic framework that provides automatic fault detection and repair for either local or wide area network services. The goal of this project is to build such a framework by leveraging dynamic slices on PlanetLab and using them as a distributed sandbox for service replication and containment. The end result should be a new PlanetLab service that automatically maintains some desired level of availability for specific PlanetLab services by detecting faults and replicating services on additional nodes as needed. One starting point for this might be George Candea's work on Medusa.
Peer to Peer Networking
Recently, a number of research proposals have addressed the problem of fully distributing content across a large set of nodes spread across the Internet. Sample systems include Chord, CAN, Pastry, and Tapestry. While all of these projects have a number of commonalities, they each also take slightly different approaches to distributing content, achieving locality, and delivering robustness in the face of failure. The goal of this project is to quantitatively evaluate the differences among these various approaches. Under what conditions does each perform differently? Are any of them inherently better than the others? Or can they all be tuned to deliver similar performance, reliability, network stress, etc.? Is one approach simpler than any of the others, making it easier to deploy or more robust to failures. Finally, what are some of the common weaknesses that your study uncovers? How can these systems be improved?
We have built an infrastructure, MACEDON, for quickly building existing and new peer to peer and overlay networks. A number of overlays have already been built in the system, including Chord, Pastry, Scribe, SplitStream, NICE, Bullet, and RanSub. Furthermore, we have built a scalable and accurate large-scale network emaultoin environmnet, ModelNet, that allows the for experimentation with thousands of instances of distributed systems while subject to the hop-by-hop characteristics of a user-specified network topology. Both infrastructures may be used as starting points for your evaluations.
Random Distribution versus Directed Placement
In existing systems, content is essentially randomly distributed across the network based on a hash of a node's IP address and the content to be stored. This approach has the advantage of simplicity and scalability (routing in, e.g., log n steps). However, it can be more difficult to achieve locality and robustness in this scenario. Also, it is unclear how to best distribute/replicate resources in the face of changing access patterns and perhaps per-object availability and performance goals (differentiated QoS). A "more traditional" approach is to track changing access patterns and to deterministically place and replicate content in response to changing network conditions and access patterns. While this approach holds the promise of delivering better performance with reduced resource utilization, it also faces the challenge of significantly increased complexity, the potential need to maintain hard state (making crash recovery and faulty tolerance more difficult), and increased overhead to probe and measure changing network characteristics.
For this project, come up with a scheme to compare and contrast randomized content placement versus directed placement. In the best case, how much state will be maintained system wide for each approach? (Think of this as a calculation of a lower bound.) What are the performance and resource utilization characteristics of each approach? Do you have any recommendations for how such systems should be built?
Performance Tuning of ModelNet
On a single node, ModelNet, is currently able to accurately emulate the characteristics of 125,000 packets/second (varying with the average number of hops through the emulated network) on a 1.4 Ghz Pentium III. The goal of this project is to carry out a detailed performance characterization of the system (including across multiple architectures) to determine where the performance bottlenecks are and to perform performance optimizations where necessary. As part of this effort, your group may decide to undertake a port of the system to Linux, as the current FreeBSD implementation may not perform as well in the network code.
In the next few years, 10 Gb/s Ethernet will become available. While ModelNet is currently able to saturate a 1 Gb/s link (for full-sized packets), what are the fundamental limitations that would keep us from saturating a 10 Gb/s link? Interesting follow on work includes considering how network processors might help with the performance of the system.
Application-Based Evaluation of ModelNet
As described above, ModelNet is a local area emulation environment that captures the characteristics of wide-area networks. The goal is to run and evaluate unmodified applications (e.g., web servers, content distribution networks) and network protocols (e.g., new versions of TCP, HTTP, etc.) on a local cluster of ~100 nodes. The system would take as input a target network topology and configure a set of core machines as traffic shapers, accurately emulating the characteristics of end-to-end paths across the network.
The goal of this project is to evaluate the accuracy of ModelNet for a variety of application scenarios. For instance, we would like to use ModelNet to reproduce the results of published papers evaluating content delivery networks, network congestion protocols, etc. Can we use ModelNet to evaluate the performance of next generation network services without having to rely upon actually exhaustively running working code across hundreds of Internet sites? Can we instead use existing characterization of wide-area network services as input to a system that faithfully reproduces those characteristics and subjects real code to these conditions?
Adaptive Delivery of TCP Acknowledgments
TCP acknowledgments are an important mechanism to allow senders to learn of the receivers advertised window, network congestion, and estimated round trip time to the receiver. However, these ACK packets (often containing no useful data) form a significant source of traffic on the Internet today. In this project, we will consider techniques for reducing the number of required ACK packets from receiver to sender. One possible technique is to adopt a hybrid NACK/ACK approach. Here, receivers will only transmit frequent ACKs under (for example) the following scenarios:
The size of the advertised window has changed by a threshold amount. In this case, an ACK is generated to maintain flow control between sender and receiver.
A packet drop (a hole in the sequence stream) is detected or a threshold amount of time passes. This latter point is to ensure that the sender does not needlessly retransmit a packet. Note that a detected hole in the packet sequence can result in a NACK rather than an ACK.
Another reason to transmit frequent ACKs is to maintain the senders running average and standard deviation of the round trip time measurements (used in retransmission timers for example). The sender can indicate current standard deviation in RTT measurements. As long as the deviation is low, ACKs can be in frequent. As the variation begins to increase (perhaps indicated in the TCP header), more frequent ACKs can be transmitted to properly refresh the retransmission timer.
The above are only a subset of potential techniques for reducing the number of acknowledgements. The performance (in terms of delivered throughput, overall network fairness, number of transmitted packets) can be measured under a range of different scenarios in ns.
An interesting paper describing the above issues is available here.
Sharing Information across Congestion Windows
Consider the case where a browser makes four concurrent connections to a web server under HTTP/1.0 to retrieve multiple elements in a web page. In this scenario, each TCP stream must independently ramp up to the bottleneck bandwidth using slow start. Since most web objects are quite small (4-10 KB in size) many streams do not have the opportunity to ramp up to the bottleneck bandwidth before the connection terminates. This is especially true in the face of the increasing bandwidth-delay products in the Internet. One possible approach around this problem is to share information about the bottleneck bandwidth in the operating system for streams to the same destination. Thus, rather than ramping up to the same bottleneck bandwidth using slow start, the operating system could have each stream set its congestion window to be half the value of any existing streams to that same destination.
Implement this technique in a real operating system (e.g., linux or FreeBSD) and evaluate its performance benefits relative to standard window size management techniques in TCP.
Scalable Receiver Counting in Multicast Environments
One important issue in multicast environments is determining the number of receivers associated with a stream. It is not scalable for each receiver to inform the sender that it is interested in joining the transmission (consider billions of simultaneous receivers). Devise and evaluate (using a network simulator such as ns) a scalable technique for counting the number of receivers associated with a stream. Your approach might employ some form of hierarchy to aggregate multiple receivers in a given region of the network (more generally, in a given region of a multicast distribution tree). Consider the security implications associataed with your approach. What level of trust are you placing in certain nodes? How can you employ a measure of checks and balances to ensure that a misbehaving host does not egregiously affect the final vote tally.
Consider the work on EXPRESS as a starting point for your thinking.
Another related issue with multicast is authentication/authorization of both the senders and receivers associated with a multicast stream. How can receivers prove that they have the right to listen in on a stream (consider pay-per-view)? How can a sender prove that they have the right to transmit (consider a video-conference)?
Ad-Hoc Networking for Wireless Sensor Networks
Consider deployment of mobile sensors with the goal of providing coverage of a geographic region. For example, video cameras mounted on mobile robots may wish to constantly "patrol" a given region. Each sensor has wireless connectivity with limited range and bandwidth. One goal of the system is to avoid any hard state, that is all aspects of the target environment should be learned independently by individual sensors. For instance, initially, no sensor will be aware of the presence, capabilities, or position of any other sensor. Similarly, exact details of the terrain (hills, buildings, etc.) may not be available and must be learned independently. Alternatively, a rough notion of terrain may be available with a directive to pay close attention to regions of particular interest. Variables to consider include available power, range of communication equipment, and number of sensors. There are a broad range of questions to explore in this design space, enough for a number of independent projects. Questions to consider include:
What protocols will be used to communicate aspects of the terrain and presence of other sensors to all participants? How do you maintain group membership information about available sensors. How do the sensors cooperate to carry out their directives (e.g., focus on a particular geographic area) in a decentralized manner?
How can the sensors organize themselves to provide maximal coverage of a given area? Similarly, how can the sensors position themselves to minimize the required communication in an ad-hoc network? Are the goals of maximizing coverage at odds with organizing a fault-tolerant and efficient (both in raw performance and power consumption) ad-hoc network?
How can the system survive and reorganize in the face of addition or subtraction of additional mobile sensors?
What if the system were to become more complicated so that the task of the sensors was not simply to cover a given area but to explore the terrain looking for particular features. For example, in the case of ad-hoc networks for disaster recovery, autonomous mobile sensors may search an area for survivors. Once a survivor has been located, this information must be quickly communicated to a base station so that rescue workers can act on the information.
The above is a specific example of the more general case of communicating the findings of the results to a wired base station. How can the sensors organize themselves to efficiently (again, both in terms of power consumption and time) transmit information to a "sink".
What if certain sensors were optimized for different tasks? For example, some sensors may have more powerful radio transmitters, larger memory capacities. Perhaps such compute stations will serve as a regional cache for intermediate information, or they will perform expensive computation (e.g., pre-processing) to condense the size of information that must be transmitted to the rest of the network (or a base station) saving bandwidth and power. Sensors may also be differentiated, with different sensors able to collect audio, video, or chemical information respectively.
How can the availability of low-power GPS units be leveraged in this scenario? Knowing roughly where they are located will greatly enhance the sensors ability to self-organize for optimal coverage and data dissemination.
What protocols should be utilized in this network of sensors? Is the reliable in-order delivery (and the associated overhead) provided by TCP necessary? What about disseminating information through multicast? Is reliable multicast a requirement? How should it/might it be achieved in this scenario.
The paper, "Negotiation-based Protocols for Disseminating Information in Wireless Sensor Networks" provides a good start at describing some of these issues. Developing reasonable techniques for evaluating the above questions is part of the project. Options include using the Monarch extensions to the popular network simulator, ns (Monarch supports simulation of mobile hosts with wireless communication), or developing your own simulator to evaluate a hypothesis in this environment.
Multicast technology is important because it allows for the efficient delivery of data to a large number of interested listeners. For example, a video broadcast may be efficiently transmitted to millions of listeners (each with different available network bandwidth and screen sizes). Multicast is also useful for event-delivery mechanisms in emerging Ubiquitous Computing programming models. That is, multicast could be used to efficiently transmit the occurrence of an event to arbitrary number of interested listeners distributed across the wide area.
One of the main challenges to building a reliable multicast system is dynamically building an network overlay to perform hierarchical distribution and error recovery (e.g., to prevent NACK explosion). Such virtual networks must typically be adaptive to changing network conditions as well as clients entering and leaving the system. The key question of course is developing the algorithms for dynamically reconfiguration. To simplify, the problem space, some are investigating application-specific multicast traffic. For example, Real Networks uses machines (caches) distributed throughout the network (placed at strategic locations along the network backbone) to intercept user requests for multimedia content. This approach can result in tremendous aggregate bandwidth savings. While the total bandwidth required by individual clients remains constant, the total number of hops (routers) that need to support a high-bandwidth stream can be greatly reduced by choosing a nearby server. Additional benefits include reducing wide-area congestion and reducing the probability that time-sensitive packets will be dropped on the way to the client (assuming a fixed per-hop probability of dropping a packet. A paper on RMX describes some techniques for generalizing this application-level multicast.
The goal of this project is to generalize this model. If data (such as multimedia content) is stored and replicated in the network, how can you redirect client requests to the nearest replica while considering available network performance (the baseline) and current network conditions (congestion). This is related to the general problem of naming wide-area content. Further how can the location of replicas be dynamically re-configured in response to both client access patterns and current network conditions. This is the general problem of dynamically growing multicast distribution trees.
Evaluating Tradeoffs in Power/Computation/Communication for Application Efficiency
Consider a battery-operated handheld device with a wireless network connection. Traditionally, distributed systems must evaluate computation versus communication tradeoffs. As a simple example, an application might compress a large file before transmitting it across the network. Given a fast processor, a reasonable compression ratio, and a slow network connection (e.g., a modem) , such an operation should result in increased efficiency: By trading computation for communication, the application was able to save time. Of course, such a tradeoff must be made dynamically; if the local processor is slow (or heavily loaded) and a fast LAN is available, it may very well be more efficient to transmit the large file directly. There are many examples of such tradeoffs in other fields of computer science. For example, when performing voice recognition, the slow processor in a handheld may be better off performing initial analysis and sending an intermediate form to the server for complete evaluation. What is the correct point for determining the "cut" of an application? When is it worthwhile to package up a request to the server and wait for the response rather than performing the request locally? Note that making such a determination requires "planning under uncertainty" --- using inexact knowledge of distributed sate information to optimize evaluation. More information on such tradeoffs can be found in "Active Names: Programmable Location and Transport of Wide-area resources" and "Agile Application-Aware Adaptation for Mobility".
The goal of this project is to consider tradeoffs similar to the above (adding to the list if possible) while considering power as an additional resource. In addition to a time component to different courses of action, there will now also be a power component. Thus, the system must satisfy the twin, and sometimes contradictory, goals of optimizing time and power, both of which are limited resources. Models are available for CPU power consumption as well as the power consumption of wireless network interfaces. Start by picking and refining a power consumption model for a few typical wireless handhelds (e.g., Palm vs. Windows CE, Proxim vs. Bluetooth vs. WaveLAN network interface). Then build a system to evaluate power/computation/communication tradeoffs for some real application class, such as distilled Web browsing from the Active Names work or voice recognition from the Odyssey work above. Of course, you are free to develop your own scenario. Development need not take place on actual handhelds, though your system should attempt to accurately account for power consumption in making decisions. The Active Names framework (source code available) might be a starting point for conducting this study.