Full Raft Implementation

Google form link

Due Date: May 3rd, 11:59PM

For the second checkpoint, you’ll extend your basic key-value storage service from the first checkpoint to implement Raft. The Raft paper provides a great overview of the various RPCs used to achieve its implementation, which you can easily replicate using gRPC protobuf files.

At the end of this checkpoint, you’ll have a failure-tolerant key-value storage system with a leader election protocol and log replication. That might sound like a lot, but a strong understanding of the Raft paper should make most of the implementation details clear.

There are two components of Raft that we do not expect you to implement: node join/leave operations, and log compaction. Implementing these introduces a lot of additional complexity into the system that, while useful to evaluate in a real production environment, is out of scope for this project. While we do expect you to test with different numbers of nodes, each test should be run with a static node configuration (i.e. the number of nodes is fixed for the entire duration of a single test).

Checkpoint 1 Checklist

In checkpoint 1, we asked you to create both a simple key-value storage service with trivial replication and a chaos monkey service to induce network failures. We want to review a few key points that will ensure that everything goes smoothly for checkpoint 2:

Raft Details

As mentioned above, you do not have to implement dynamic node joining/leaving and log compaction for the technical report. These correspond to sections 6 and 7 of the Raft paper. This means you will be implementing all of the functionality described in sections 5 and 8. The two key components are leader election and log replication.

You will need to modify your key/value service to leverage a persistent, write-ahead log that can be used to recover the in-memory key/value data structure in the case of a node failure. How you do this is up to you, but make sure that servers do not share the same log file (that’s cheating!) and that it doesn’t vanish when you terminate a node.

The remaining details of Raft can be found in the paper, which does an excellent job of outlining the various state parameters and RPC arguments required. We don’t recommend you deviate too much from the specification in the paper, but we do recommend adding additional RPCs for testing purposes.

Testing Raft

In order to test Raft sufficiently, you’ll want to ensure that you can verify your system under a variety of conditions. Here are a few (but non-exhaustive) examples of non-ideal conditions you should testing in:

Client Operations and Raft

Per section 8 of the paper, clients must ensure at-most-once semantics. The leader should also not return to the client that a put operation was completed until the key/value pair has been committed. You should not use a leasing for get requests, but use the heartbeat exchange method described in section 8.

Measuring Performance

While you develop your system for RAFT, we’d like you to think about measuring performance in the various tests you write. The purpose of the report is to investigate and explain the observed characteristics of your implementation, so it’s very important that you design your tests with measurement in mind.

You should be thinking about not only what to measure, but how to measure the time various operations take within your system. There are a variety of ways that you can gather performance statistics; using multiple methodologies can help provide a breadth of information that will be helpful when writing your technical report.