PA2: SurfStore

2019/02/07

FAQ

Pre-submission checklist

Feb 13, 2019

Where on the server are files kept?

Feb 12, 2019

Does the next version of a file have to be ==(current version + 1) or just >=(current version + 1)?

If there is a version change for a file after the client gets an updated FileInfoMap, and the client determines it has to download the file, should it download the version from the FileInfoMap or the current version on the server?

Can we assume that the files on the client wont change for the duration of sync?

What should we do for directories in base_dir?

Since we are given that “the last block in the file … may be smaller (but must be at least 1 byte large)” does this mean we wont have to deal with empty files?

Assuming only one client (and no other changes occur outside from the programs), by the end of sync() the FileInfoMap stored on the client and server should be the same, right?

Can we change the starter code?

How to handle turning a C++ string into a hash value?

Can we assume there will always be a valid “index.txt” file? And its content can only be modified by SurfStoreClient?

Are we required to implement these functions for grading? Can we change the signatures, or just write our own RPC functions?

SurfStore

In this project you are going to create a cloud-based file storage service called SurfStore. SurfStore is a networked file storage application that is based on DropBox, and lets you sync file to and from the “cloud”. You will implement the cloud service, and a client which interacts with your service via RPC.

Multiple clients can concurrently connect to the SurfStore service to access a common, shared set of files. Clients accessing SurfStore “see” a consistent set of updates to files, but SurfStore does not offer any guarantees about operations across files, meaning that it does not support multi-file transactions (such as atomic move).

The SurfStore service is composed of the following two sub-services:

Specification

We now describe the service in more detail.

Basic concepts

Blocks, hashes, and hashlists

A file in SurfStore is broken into an ordered sequence of one or more blocks. Each block is of uniform size (defined in the configuration file), except for the last block in the file, which may be smaller (but must be at least 1 byte large). As an example, assume the block size is 4096 bytes, and consider the following file:

Hash example

The file ‘MyFile.mp4’ is 14,437 bytes long, and the block size is 4KB. The file is broken into blocks b0, b1, b2, and b3 (which is only 2,149 bytes long). For each block, a hash value is generated using the SHA-256 hash function. So for MyFile.mp4, those hashes will be denoted as [h0, h1, h2, h3] in the same order as the blocks. This set of hash values, in order, represents the file, and is referred to as the hashlist. Note that if you are given a block, you can compute its hash by applying the SHA-256 hash function to the block. This also means that if you change data in a block the hash value will change as a result. To update a file, you change a subset of the bytes in the file, and recompute the hashlist. Depending on the modification, at least one, but perhaps all, of the hash values in the hashlist will change.

We will be using okdshin’s PicoSHA2 hash library for this project, which is included in the starter code.

Files and filenames

Files in SurfStore are denoted by filenames, which are represented as strings. For example “MyDog.jpg”, “WinterVacation.mp4”, and “Expenses.txt” are all example of filenames. SurfStore doesn’t have a concept of a directory or directory heirarchy–filenames are just strings. For this reason, filenames can only be compared for equality or inequality, and there are no “cd” or “mkdir” commands. Filenames are case sensitive, meaning that “Myfile.jpg” is different than “myfile.jpg”. Filenames cannot contain the “/” character.

The base directory

The configuration file specifies a “base directory” for the client. This is the directory that is going to be synchronized with your cloud-based service. Your client will upload files from this base directory to the cloud, and download files (and changes to files) from the cloud into this base directory. Your client should not modify any files outside of this base directory. Note in particular–your client should now download files into the “current” directory, only the directory specified in the configuration file.

File versions

Each file/filename is associated with a version, which is a monotonically increasing non-negative integer. The version is incremented any time the file is created, modified, or deleted. The purpose of the version is so that clients can detect when they have an out-of-date view of the file hierarchy.

For example, imagine that Tahani wants to update a spreadsheet file that tracks conference room reservations. Ideally, they would perform the following actions:

One client

However, another client might be concurrently modifying this file as well. In reality, the order of operations might be:

Two clients incorrect

As you can see, Tahani overwrote the change that Jianyu made without realizing it. We can solve this problem with file versions. Every time a file is modified, its version number is incremented. SurfStore only records modifications to files if the version is one larger than the currently recorded version. Let’s see what would happen in the two-client case:

Two clients correct

In the above example, both Tahani and Jianyu downloaded identical copies of confroom.txt (at version 5). They then both started editing their local copies of the file. So there was a point where Tahani had “her own” version 6 of the file (with her local changes), and Jianyu had “his own” version 6 of the file (with his local changes). How do we know whose “version 6” is the real version 6?

The answer is that whoever syncs their changes to the cloud first wins. So in this example, Jianyu was first to sync his changes to the cloud, which caused his modifications to the file to become the official version 6 of the file. Later, when Tahani tries to upload her changes, she realizes that Jianyu beat her to it, and so Jianyu’s changes to the file will overwrite her copy.

Deleting files

To delete a file, the MetadataStore service records a versioned “tombstone” update. This update simply indicates that the file has been deleted. In this way, deletion events also require version numbers, which prevents race conditions that can occur when one client deletes a file concurrently with another client deleting that file. Note that this means that a deleted file must be recreated before it can be read by a client again. If a client tries to delete a file that has already been deleted, that is fine–just handle the version numbers of these tombstone updates appropriately.

To represent a “tombstone” record, we will set the file’s hash list to a single hash value of “0” (zero).

Client

Your client will “sync” a local base_dir base directory with your SurfStore cloud service. When you invoke your client, the sync operation will occur, and then the client will exit. As a result of syncing, new files added to your base directory will be uploaded to the cloud, files that were sync’d to the cloud from other clients will be downloaded to your base directory, and any files which have “edit conflicts” will be resolved. Resolving conflicts will be described below.

A simple example, assuming that Tahani keeps her files in /tdata, and Jianyu keeps his files in /jdata:

tahani $ ls /tdata
tahani $
(Tahanis base directory starts empty)

tahani $ cp ~/cat.jpg /tdata
tahani $ ./ss myconfig.ini
(syncs cat.jpg to the cloud)

jianyu $ ls /jdata
jianyu $
(Jianyus base directory starts empty)

jianyu $ ./ss myconfig.ini
(cat.jpg gets syncd from the cloud
jianyu $ ls /jdata
cat.jpg

The client maintains an index.json file in the base directory which holds local, client-specific information that must be kept between invocations of the client. In particular, the index.json contains a copy (in JSON format) of the server’s FileInfoMap accurate as of the last time that sync was called. The purpose of this index file is to detect files that have changed, or been added, to the base directory since the last time that the client executed.

Storing blocks

The SurfStore server keeps a map of data blocks in memory, stored in a key-value store.. It supports basic get() and put() operations. It does not need to support deleting blocks of data–we just let unused blocks remain in the store. The BlockStore service only knows about blocks–it doesn’t know anything about how blocks relate to files.

The service implements the following API:

Metadata

Each SurfStore server maintains the mapping of filenames to hashlists (and version numbers) called a FileInfoMap. All metadata is stored in memory, and no database systems or files will be used to maintain the data. When we test your project, we will always start from a “clean slate” in which there are no files in the system.

The service implements the following API:

To create a file that has never existed, use the update\_file() API call with a version number set to 1. To create a file that was previously deleted, update the version number that is one larger than the “tombstone” record.

Basic operating theory

When a client syncs its local base directory with the cloud, a number of things must be done to properly complete the sync operation.

The client should first scan the base directory, and for each file, compute that file’s hash list. The client should then consult the local index file and compare the results, to see whether (1) there are now new files in the base directory that aren’t in the index file, or (2) files that are in the index file, but have changed since the last time the client was executed (i.e., the hash list is different).

Next, the client should connect to the server and download an updated FileInfoMap. For the purposes of this discussion, let’s call this the “remote index.”

The client should now compare the local index (and any changes to local files not reflected in the local index) with the remote index. A few things might result.

First, it is possible that the remote index refers to a file not present in the local index or in the base directory. In this case, the client should download the blocks associated with that file, reconstitute that file in the base directory, and then add the updated FileInfo information to the local index.

Next, it is possible that there are new files in the local base directory that aren’t in the local index or in the remote index. The client should upload the blocks corresponding to this file to the server, then update the server with the new FileInfo. If that update is successful, then the client should update its local index. Note it is possible that while this operation is in progress, some other client makes it to the server first, and creates the file first. In that case, the update\_file() operation will fail with a version error, and the client should handle this conflict as described in the next section.

Handling conflicts

The above discussion assumes that a file existed in the server that wasn’t locally present, or a new file was created locally that wasn’t on the server. Both of these cases are pretty straightford (simply upload or download the file as appropriate). But what happens when there is some kind of version mismatch, as described in the motivation at the top of this specification? We describe what to do in this subsection.

Imagine that for a file like cat.jpg, the local index shows that file at version 3, and we compare the hash list in the local index with the file contents, and confirm that there are no local modifications to the file. We then look at the remote index, and see that the version on the server is larger, for example 4. In this case, the client should download any needed blocks from the server to bring cat.jpg up to version 4, then reconstitute cat.jpg to become version 4 of that file, and finally the client should update its local index, brining that entry to version 4. At this point, the changes from the cloud have been merged into the local file.

Consider the opposite case: the client sees that its local index references cat.jpg with version 3. The client compares the hash list in the local index to the file contents, and sees that there are uncommitted local changes (the hashes differ). The client compares the local index to the remote index, and sees that both indexes are at the same version (in this case, version 3). This means that we need to sync our local changes to the cloud. The client can now update the mapping on the server, and if that RPC call completes successfully, the client can update the entry in the local index and is done (there is no need to modify the file’s contents in the base directory in this case).

Finally, we must consider the case where there are local modifications to a file (so, for example, our local index shows the file at version 3, but the file’s contents do not match the local index). Further, we see that the version in the remote index is larger than our local index. What should we do in this case? Well, we follow the rule that whoever syncs to the cloud first wins. Thus, we must go with the version of the file alrady syncd to the server. So we download any required blocks and bring our local version of the file up to date with the cloud version.

Implementation details

Configuration file

For this project, you will use a configuration file that specifies the host and port of the server, and the location of the base directory that the client will use. Further, the block size is specified in the configuration file as well. You should use the block size provided in the config file.

Libraries

This project relies on five libraries to function–the librpc library to provide RPC functionality (discussed in class), a logging library, a configuration file handling library, a SHA-256 library used to compute hashes of binary blocks, and a JSON library that is used to maintain the index.json file. Note that our starter code “wraps” the JSON code so you don’t really need to interact with the JSON library directly.

This project includes quite a bit of starter code so that you can focus on the logic of the storage protocol. Project 2 doesn’t really require you to write a lot of code, but the code that you do write is quite intricate, with several potential cases you need to reason about. Start early!

Grading rubric

Starter code

The starter code is available via this GitHub invitation.

Submitting your work

Log into gradescope.com and upload your solution as a zipped folder of your repository or upload a submission using the github submission method. If you use the github method, make sure you submit after you push a commit as the changes will not automatically reflect on gradescope.

Due date/time

Friday Febrary 22, 5pm

Points

This assignment is worth 20 points