Basic K/V Storage Service & Chaos Monkey

Due Date: April 16th, 11:59PM

For the first checkpoint, you will be implementing a basic key-value storage service (and client to test it!), along with a “chaos monkey” feature that will allow you to simulate various network failures (on top of any failures AWS induces on its own).

We expect all students to use gRPC, Google’s RPC framework that simplifies service definitions and provides automatic message serialization/deserialization. We’re providing some simple skeleton .proto files for this assignment to get you started.

We’ll be providing AWS credits to test your system at a larger scale. We’re hoping to have AWS credits available within the first couple weeks of the course, as we’d like to ensure that for the first checkpoint that all groups are able to run their key-value storage service at scale without issues. We aren’t restricting what features of AWS you can use- if you’d prefer to deploy your service via AWS Lambda instead of running VMs manually, go for it!

gRPC Introduction

gRPC is a cross-language RPC framework based on Google’s Protocol Buffers library. Protocol buffers, or protobufs, provide a simple method to serialize/deserialize complex data structures and messages over the network, which is often a very necessary and very tedious component of distributed systems. gRPC provides an additonal wrapper around protobufs to define various RPC services with strictly defined input and output types.

We highly recommend that everyone read through the gRPC Quick Start Guide for your language of choice. The starter code provided by the developers is rather robust and can get you started very quickly. Again, feel free to choose any language(s) that you’re comfortable with.

Basic Key-Value Storage

For the first part of this assignment, you’ll be implementing an extremely basic key-value storage service. This is meant to mostly be an exercise to set up your codebase, get used to gRPC, and make sure your group can run things on AWS. Don’t overthink this- the hard part (implementing RAFT) will come later!

Your RPC-based K/V store should demonstrate the following:

We’ll discuss each of these points below.

When implementing your K/V store, just store and retrieve data from whatever data structure you see fit. Don’t worry about persistence- storing everything in RAM is fine.

K/V Client Operations

In order to test your service, you’ll need a client to give it data! For now, the client should test the two primary operations (get and put). It doesn’t matter how you implement your client- so long as it uses gRPC.

You should think about a variety of ways that you can test various edge cases that might come up, as you’ll need to demonstrate these cases during the final demo and describe them in your technical report. One example could be calling put on one server and then get on another to test your basic replication mechanism (see below).

Cluster Networking and Replication

After you get a basic client/server working, move to having multiple servers running simultaneously in a cluster. Modify your client to randomly choose different servers to store and retrieve K/V pairs from.

In order to implement basic replication, simply have a server replicate and broadcast any put requests it gets to the other running servers. Don’t worry if the replication fails for now (though don’t allow the server to crash if it does fail!).

K/V Storage .proto File

Below is a basic .proto file to get you started. You’ll probably want more RPCs than this- feel free to add as many as you like!

kvstore.proto

syntax = "proto3";

package kvstore;

service KeyValueStore {
    rpc Get(GetRequest) returns (GetResponse) {}
    rpc Put(PutRequest) returns (PutResponse) {}
}

// You'll likely need to define more specific return codes than these!
enum ReturnCode {
    SUCCESS = 0;
    FAILURE = 1;
}

message GetRequest {
    string key = 1;
}

message GetResponse {
    string value = 1;
    ReturnCode ret = 2;
}

message PutRequest {
    string key = 1;
    string value = 2;
}

message PutResponse {
    ReturnCode ret = 1;
}

Chaos Monkey

The other major component for this checkpoing is developing a “chaos monkey” to help test your K/V store. Put simply, a chaos monkey is a “helper” designed to help simulate system and network failures. While you won’t have too much use for the chaos monkey for this checkpoint, it will be essential for testing your full implementation of RAFT.

You should implement a chaos monkey within your K/V store server that runs a check every single time an RPC message is received (i.e. within the RPC handler). The check will generate a random value between 0 and 1, and then compare this against the value in a “connectivity matrix”. If the value generated is strictly less than the value in the matrix, the message should be dropped entirely.

Note that dropping a message is different than responding immediately with an error! You might be tempted to return an error code as the response, but this has the effect of letting the sender know immediately that there is a problem, which is rarely the case in a real network.

The Connectivity Matrix

The connectivity matrix provided to the chaos monkey service represents the chance that a given message between two K/V servers will be dropped. A value of 0 means messages will never be dropped, and a value of 1 means messages will always be dropped. Thus, when node j receives a messages from node i, the value (i,j) in the matrix represents the probability that node j will drop the message in its RPC handler.

For example, assume that there are 5 running K/V servers, and the connectivity matrix looks something like this:

0.0 0.1 0.5 1.0 0.0
1.0 0.0 0.4 1.0 0.0
0.0 0.0 0.0 1.0 0.0
1.0 1.0 1.0 1.0 1.0
0.4 0.0 0.1 1.0 0.3

With this matrix, if node 2 receives a message from node 0, there is a 50% chance that node 2 will drop the message. With row and column 3 all containing the value 1.0, there is a 100% chance that all messages both sent to and sent from node 3 will be dropped.

In order to give the connectivity matrix to nodes, you’ll have another .proto file that defines the ChaosMonkey service. The only function of the service will be to upload the entire connectivity matrix or modify individual entries.

You should write a second client that uploads a few sample connectivity matricies. Don’t worry about race conditions when sending the matrix to a server- if some servers happen to use slightly stale values for a few milliseconds, that’s okay.

Handling Failures

Because messages are going to be dropped by the chaos monkey, you’ll need to have basic exception handlers in your K/V store. For now, they won’t do much- just silently fail and continue operation as normal. Just make sure that your implementation doesn’t crash when a node stops responding!

Chaos Monkey .proto File

Below is a basic .proto file defining the ChaosMonkey service. You’re actively encouraged to add to or modify this file in any way that you want!

chaosmonkey.proto

syntax = "proto3";

package chaosmonkey;

service ChaosMonkey {
    rpc UploadMatrix(ConnMatrix) returns (Status) {}
    rpc UpdateValue(MatValue) returns (Status) {}
}

enum StatusCode {
    OK = 0;
    ERROR = 1;
}

message Status {
    StatusCode ret = 1;
}

message ConnMatrix {
    message MatRow {
        repeated float vals = 1;
    }
    repeated MatRow rows = 1;
}

message MatValue {
    int32 row = 1;
    int32 col = 2;
    float val = 3;
}

Testing and AWS

The vast majority of the technical report will be you reporting on the performance you observe in a variety of scenarios. As such, the bulk of your code won’t actually be your implementation- it will be all of the code you use to test it!

The chaos monkey service is a great start to testing your code, but by itself it won’t sufficiently cover all of the edge cases you’ll need to test. A well-tested system will need to use more complex chaos monkey APIs (or sequence the provided ones in a complex way) in order to test your system well.

In order to test your service at scale, you’ll need to run it with a larger number of machines. AWS provides an excellent way to do this! ACMS has provided us with AWS credits for you to use for the technical report.

The instructions on how to use AWS via your UCSD account can be found at the following URL:

https://go.ucsd.edu/2AO0y2q

Expectations for this Checkpoint

For this checkpoint, we will expect for you to show us the following: