Full Raft Implementation
Due Date: May 3rd, 11:59PM
For the second checkpoint, you’ll extend your basic key-value storage service from the first checkpoint to implement Raft. The Raft paper provides a great overview of the various RPCs used to achieve its implementation, which you can easily replicate using gRPC protobuf files.
At the end of this checkpoint, you’ll have a failure-tolerant key-value storage system with a leader election protocol and log replication. That might sound like a lot, but a strong understanding of the Raft paper should make most of the implementation details clear.
There are two components of Raft that we do not expect you to implement: node join/leave operations, and log compaction. Implementing these introduces a lot of additional complexity into the system that, while useful to evaluate in a real production environment, is out of scope for this project. While we do expect you to test with different numbers of nodes, each test should be run with a static node configuration (i.e. the number of nodes is fixed for the entire duration of a single test).
Checkpoint 1 Checklist
In checkpoint 1, we asked you to create both a simple key-value storage service with trivial replication and a chaos monkey service to induce network failures. We want to review a few key points that will ensure that everything goes smoothly for checkpoint 2:
Ensure your configurations define runtime parameters. This notably includes things like key/value sizes, gRPC call timeouts (see below), and the various parameters you’ll need for Raft. Make sure you don’t use static port numbers, as you’ll need to run multiple servers per VM in order to properly test scalability.
Make sure you can easily change between configurations. There’s a lot of ways you can do this, but you need to be able to easily change between running with 5 processes, to 10 processes, etc. Having either a large set of pregenerated configuration files or a testing script that can generate configurations at runtime is fine.
Have a set of robust, programatic client and chaos monkey test cases. The purpose of the technical report is to not only report on the typical performance of your system, but how it performs in a variety of edge cases. For example, the Raft paper itself presents results on how quickly the cluster can recover from a failed leader, which is one concrete example that you should test.
Make RPCs that can inspect internal server state. Methods that can return state information are critical for ensuring correctness of a distributed system, so they’ll be necessary for many of your test cases. Verifying even simple server state can help detect failures, and inspection can help gather statistics that will be invaluable for your technical report.
When “dropping” a message with your chaos monkey, the caller should encounter a timeout. Returning immediately with an error will make it nearly impossible to test some failure modes in Raft. Setting timeouts in gRPC explicitly (per your configuration file) can ensure that drops are emulated correctly.
Clients should retry operations on failure. Operations in distributed systems are expected to sometimes fail, and thus clients must plan for that failure. A dropped client operation should be retried depending on configuration parameters. In tests where failure is expected, client retries are important sources of both information and verification.
Raft Details
As mentioned above, you do not have to implement dynamic node joining/leaving and log compaction for the technical report. These correspond to sections 6 and 7 of the Raft paper. This means you will be implementing all of the functionality described in sections 5 and 8. The two key components are leader election and log replication.
You will need to modify your key/value service to leverage a persistent, write-ahead log that can be used to recover the in-memory key/value data structure in the case of a node failure. How you do this is up to you, but make sure that servers do not share the same log file (that’s cheating!) and that it doesn’t vanish when you terminate a node.
The remaining details of Raft can be found in the paper, which does an excellent job of outlining the various state parameters and RPC arguments required. We don’t recommend you deviate too much from the specification in the paper, but we do recommend adding additional RPCs for testing purposes.
Testing Raft
In order to test Raft sufficiently, you’ll want to ensure that you can verify your system under a variety of conditions. Here are a few (but non-exhaustive) examples of non-ideal conditions you should testing in:
- Network disconnections and partitions. This is the entire point of your chaos monkey! It’s important to ensure that leaders aren’t elected when communication isn’t possible, that client operations don’t complete when log entries can’t be committed, and that two active leaders can’t exist simultaneously across two separate network partitions (though a leader who becomes part of a minority partition might continue to try appending entries–what will happen then?).
- Crashing nodes. Servers crash, and your system needs to work despite this. While we won’t ask you to shut down an entire VM, you should plan on killing and restarting a server instance (even if it’s the leader!) and see how your system performs.
- Concurrent clients. The leader needs to be able to handle concurrent client operations in order to provide any significant performance, and conflicting operations must be resolved in order to ensure system safety.
- “Bad” configuration parameters. The Raft paper details that certain parameters must be set within some thresholds of one another- what happens when this isn’t true?
Client Operations and Raft
Per section 8 of the paper, clients must ensure at-most-once semantics. The leader should also not return to the client that a put
operation was completed until the key/value pair has been committed. You should not use a leasing for get
requests, but use the heartbeat exchange method described in section 8.
Measuring Performance
While you develop your system for RAFT, we’d like you to think about measuring performance in the various tests you write. The purpose of the report is to investigate and explain the observed characteristics of your implementation, so it’s very important that you design your tests with measurement in mind.
You should be thinking about not only what to measure, but how to measure the time various operations take within your system. There are a variety of ways that you can gather performance statistics; using multiple methodologies can help provide a breadth of information that will be helpful when writing your technical report.