HW 5: Remote procedure calls and SurfStore

2018/10/27

Birrell and Nelson

Read the original paper on RPCs:

Andrew D. Birrell and Bruce Jay Nelson. Implementing remote procedure calls. ACM Trans. Comput. Syst. 2, 1 (February 1984), 39-59. DOI=http://dx.doi.org/10.1145/2080.357392

As you read the paper, keep the following questions in mind. Note that you don’t need to answer them in your write-up.

Python tutorial

If you’d like some online training on Python, UCSD has partnered with Pluralsight to offer an optional online course that teaches the fundamentals of Python. Visit https://app.pluralsight.com/paths/skills/python and click the “Log in” link on the upper-right corner of the page. Under the login area, you’ll find a link that says “Sign in with company or school”. Click that, and log in with your UCSD single sign-on credentials. You should then have access to the course.

SurfStore

In this project you are going to create a cloud-based file storage service called SurfStore. SurfStore is a networked file storage application that supports four basic commands:

Multiple clients can concurrently connect to the SurfStore service to access a common, shared set of files. Clients accessing SurfStore “see” a consistent set of updates to files, but SurfStore does not offer any guarantees about operations across files, meaning that it does not support multi-file transactions (such as atomic move).

The SurfStore service is composed of the following two sub-services:

Additionally, you will need to implement a client that can support the four basic commands listed above.

Specification

We now describe the service in more detail.

Basic concepts

Blocks, hashes, and hashlists

A file in SurfStore is broken into an ordered sequence of one or more blocks. Each block is of uniform size (4KB), except for the last block in the file, which may be smaller than 4KB (but must be at least 1 byte large). As an example, consider the following file:

Hash example

The file ‘MyFile.mp4’ is 14,437 bytes long, and the block size is 4KB. The file is broken into blocks b0, b1, b2, and b3 (which is only 2,149 bytes long). For each block, a hash value is generated using the SHA-256 hash function. So for MyFile.mp4, those hashes will be denoted as [h0, h1, h2, h3] in the same order as the blocks. This set of hash values, in order, represents the file, and is referred to as the hashlist. Note that if you are given a block, you can compute its hash by applying the SHA-256 hash function to the block. This also means that if you change data in a block the hash value will change as a result. To update a file, you change a subset of the bytes in the file, and recompute the hashlist. Depending on the modification, at least one, but perhaps all, of the hash values in the hashlist will change.

Generating SHA-256 values in Python

As an example of converting some bytes using the SHA-256 hash value (encoded using base-64) in Python:

$ python3
Python 3.5.2 (default, Nov 23 2017, 16:37:01)
[GCC 5.4.0 20160609] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import hashlib
>>> hashlib.sha256(b'\x00\x01\x02').hexdigest()
'ae4b3280e56e2faf83f414a6e3dabe9d5fbe18976544c05fed121accb85b53fc'
>>>

Files and filenames

Files in SurfStore are denoted by filenames, which are represented as strings. For example “MyDog.jpg”, “WinterVacation.mp4”, and “Expenses.txt” are all example of filenames. SurfStore doesn’t have a concept of a directory or directory heirarchy–filenames are just strings. For this reason, filenames can only be compared for equality or inequality, and there are no “cd” or “mkdir” commands. Filenames are case sensitive, meaning that “Myfile.jpg” is different than “myfile.jpg”.

Note that when uploading files, you should use the string that follows the last / as filename.

If / is not present, the entire string will be the filename. For e.g -

Upload File Path Filename
/user/surfstore/Myfile.jpg Myfile.jpg
./index.html index.html
WinterVacation.mp4 WinterVacation.mp4

File versions

Each file/filename is associated with a version, which is a monotonically increasing non-negative integer. The version is incremented any time the file is created, modified, or deleted. The purpose of the version is so that clients can detect when they have an out-of-date view of the file hierarchy.

For example, imagine that Client 1 wants to update a spreadsheet file that tracks conference room reservations. Ideally, they would perform the following actions:

One client

However, another client might be concurrently modifying this file as well. In reality, the order of operations might be:

Two clients incorrect

As you can see, Client 1 overwrote the change that client 2 made without realizing it. We can solve this problem with file versions. Every time a file is modified, its version number is incremented. SurfStore only records modifications to files if the version is one larger than the currently recorded version. Let’s see what would happen in the two-client case:

Two clients correct

To delete a file, the MetadataStore service records a versioned “tombstone” update. This update simply indicates that the file has been deleted. In this way, deletion events also require version numbers, which prevents race conditions that can occur when one client deletes a file concurrently with another client deleting that file. Note that this means that a deleted file must be recreated before it can be read by a client again. If a client tries to delete a file that has already been deleted, that is fine–just handle the version numbers of these tombstone updates appropriately.

Also note that there is a difference between a deleted file and a file that has not been created so far. Trying to read both types of file will lead to a “File Not Found” error but a deleted file has a positive value for the version number, whereas a file that does not exist has a version = 0.

Processes

SurfStore consists of three types of processes: client processes, one or more BlockStore process, and a single Metadata process. These processes never fail, never crash, and never lose connectivity with each other. If you want your service to handle these cases, I highly recommend taking CSE 223B!

Client

A client is a program that interacts with SurfStore. It is used to create, modify, read, and delete files. Your client will call the various file modification/creation/deletion RPC calls. We will be testing your service with our own client, and your client with instrumented versions of our service.

The client implements the following functionalities (the config files are described in a later section):

We will be testing your client by calling these functions. Make sure that your client matches the same syntax as described above.

Some examples:

$ client /etc/myconfig.txt upload /home/aturing/myvacation.mp4
$ mkdir /home/aturing/downloads
$ client /etc/myconfig.txt download myvacation.mp4 /home/aturing/downloads
$ diff /home/aturing/myvacation.mp4 /home/aturing/downloads/myvacation.mp4
(indicates the files are identical)
$ cp /home/aturing/springbreak.mp4 /home/aturing/myvacation.mp4
$ client /etc/myconfig.txt upload /home/aturing/myvacation.mp4
$ client /etc/myconfig.txt delete myvacation.mp4
$ client /etc/myconfig.txt download myvacation.mp4 /tmp
(indicates an error because the file isn't there)

BlockStore

The BlockStore service is an in-memory data store that stores blocks of data, indexed by the hash value. Thus it is a key-value store. It supports basic get() and put() operations. It does not need to support deleting blocks of data–we just let unused blocks remain in the store. The BlockStore service only knows about blocks–it doesn’t know anything about how blocks relate to files.

The service implements the following API:

MetadataStore

The MetadataStore process maintains the mapping of filenames to hashlists. All metadata is stored in memory, and no database systems or files will be used to maintain the data. When we test your project, we will always start from a “clean slate” in which there are no files in the system.

The service implements the following API:

To create a file that has never existed, use the ModifyFile() API call with v set to 1. To create a file that was previously deleted, use ModifyFile with a version number that is one larger than the “tombstone” update.

Basic operating theory

When a client wants to create a new file, it first contacts the MetadataStore to see if the file already exists (or existed in the past, and has since been deleted). If so, it notes the previous version number.

The client then reads its local copy of the file and splits it into blocks, as described above. It then computes the hash values of each of the blocks to form a hashlist. It then contacts the MetadataStore leader and invokes the ModifyFile() API, passing it the filename, updated version number, and hashlist.

Clients are also responsible for uploading the blocks of the file to the BlockStore service. A naive implementation would do this after uploading the hashlist via ModifyFile().

But this leads to a potential race condition: Consider the scenario -

  1. Client 1 calls the metadata store to modify a file and provides a hashlist
  2. The metadata store saves the mapping for this file and hashlist after checking that the version is correct and returns the list of missing blocks to Client 1
  3. Client 1 starts uploading the missing blocks to the block store.
  4. It is possible that another client, Client 2 tries to read the same file and gets the updated hashlist.
  5. When client 2 contacts the block store, Client 1 might still be uploading the blocks or it might have crashed before it could upload the blocks. Do you see the problem here?

To prevent these issues, the protocol you’re going to use works as follows. When the client does a ModifyFile() operation, the MetadataStore is going to query the BlockStore(s) for each of the hash values in the hashlist, to see which, if any, of the blocks are already in the BlockStore. If any blocks are missing from the BlockStore, the MetadataStore will reply back to the client with a list of missing blocks. The MetadataStore will not create or modify the filename to hashlist mapping if any blocks are not present in the BlockStore. Only when all the blocks are in the BlockStore will the MetadataStore signal a success return value to the client’s ModifyFile() operation, and from then on the new file version is available to any clients that want to download it.

As an example:

Upload example

To download a file, the client invokes the ReadFile() API call on the MetadataStore, passing in the filename. The MetadataStore simply returns the version and hashlist to the client. The client then downloads the blocks from the BlockStore to form the complete file. As an example:

Download example

When a file is deleted, we create tombstone update in the metadata store, but the blocks in blockstore are not removed. This has the advantage that a client that is downloading the blocks when another client deletes the file is not affected by the delete operation.

Implementation details

Configuration file

For this project, you will use a configuration file describing the cluster details, with the following format:

    B: 1
    metadata: <host>:<port>
    block1: <host>:<port>

The above deployment has a single BlockStore server.

    B: 2
    metadata: <host>:<port>
    block1: <host>:<port>
    block2: <host>:<port>

This example has two BlockStore servers.

This config file will be available to the client and servers when they are started. This configuration file helps the server or client know the cluster information and also how many blockstore servers are present in the service. The configuration file we provide will always be valid.

Client

Client output

You can print out log and status information to stderr. To stdout, please either print:

Note that there is no type of “old version” message, because if the client tries to update the file to version v and another client updates the metadata store first, then the original client should simply re-request the most up-to-date version number and try again.

Locating blocks

If your deployment has more than one blockstore server, you may be wondering how the client and the metadata store choose locations for data blocks. In this assignment, you are going to choose a simple, static assignment of blocks to blockstore servers using the following algorithm:

def findServer(h):
	return int(h,16) % numBlockStores

<<<<<<< HEAD

Where h is a hex-encoded SHA-256 hash value.

Where h is a hex-encoded SHA-1 hash value (The value that you generated using the function hashlib.sha1(b'\x00\x01\x02').hexdigest() ).

Starting rpyc server and connecting to it

  1. Starting a server is like running any other python command - python server.py
  2. The server should be implemented as a class that inherits the rpyc.Service class and then implements the rpc methods.
  3. The class should then provide a method to start the rpc server. You can use the ThreadedServer class provided -
if __name__ == '__main__':
		from rpyc.utils.server import ThreadedServer
		server = ThreadedServer(Server(), port=5000)
		server.start()

To connect to a server from a client use the following syntax -

import rpyc
conn = rpyc.connect('localhost', 5000)

For more information take a look at the Van Steen textbook and the rpyc documentation. >>>>>>> 0e83decc8fa09b3ad3792ef0ffdd1d5154d28a9f

Grading rubric

Starter code

The starter code is available via this GitHub invitation.

Submitting your work

Log into gradescope.com and upload your solution as a zipped folder of your repository or upload a submission using the github submission method. If you use the github method, make sure you submit after you push a commit as the changes will not automatically reflect on gradescope. Don’t forget to add your paper summary to papersummary.txt.

Due date/time

Friday Nov 9, 5pm

Points

This assignment is worth 10 points

Assigned TA

Kunal Kashilkar