Lecture 7: Replication via Quorums and Best-Effort, CVVs

Lecture 7: Replication vis Quorums and Best-Effort, CVVs

Introduction To Replication

Today we are going to move into the next topic: replication. It is often the case that we want to replicate data in a distributed system. We might do this to make our system more robust or accessible in light of failure, or to ensure that there is a copy of the data "nearby" in order to improve access latency.
But if we are not careful how we manage replication, we can actually end up with a system that encounters more latency, or is less likely to be available. During today's discussion, we are going to assume that we need one copy semantics, despite replication. In other words, we will assume that we want the results of read and write operations to be the same as they would be if they were acting on a single, non-replicated data store.
Note: For the first part of this discussion, to take things one step at a time, let's pretend that interleaving among concurrent operations is not a concern. But, I promise -- well get there very shortly.

Replication and Conflict

If we have replicated data, we have a choice about which of the copies of the data we will access to complete any operation. Perhaps we could access only one replica, or perhaps all of them, or perhaps any number in between. The decision that we make could affect consistency.
Consider a system where there are 4 replicas, R₁, R₂, R₃, and R₄. Suppose we implemented the following policy regarding read and write operations:

Reads: Either R₁ or R₂
Writes: Either R₃ or R₄

The above policy has a problem. If writes occur, the reads will read stale data. Another policy might be the following:

Reads: All of the replicas
Writes: Any one of the replicas

As long as version numbers are used, the above policy prevents stale reads. Since the read will see all of the replicas, it will find the most recent version and report it to the application. But this policy isn't really a good one -- the most recent data isn't replicated and reads, requiring access to many replicas, are usually more common tham writes, which require access to only one server.
We can solve this problem by flipping our logic. Instead of the "write-one/read-all" policy above, we can use a "read-one/write-all" policy:

Reads: Any one of the replicas
Writes: All of the replicas

The "Write-all/read-one" policy is very frequently used, because it has many good characteristics:

Read will always get the most current data
The common case, read, is fast -- it requires access to only one replica
The most recent data is fully replicated, providing fault-tolerance.

In looking at these examples, we can see that the number of replicas that are required for a read operation to remain consistent depends on the number of servers required for a write operation. We must be guaranteed that the set of replicas selected by a read operation will intersect the set of servers selected by any write operation. If we have 5 servers and it take 3 to write, it must take 3 to read. If it takes 5 to write, it will take only 1 to read. If it takes 1 to write, reads must access all replicas. If this isn't the case, there is the potential for a read-write conflict -- a read might get stale data, because it doesn't see a recent read.
The general rule for avoiding a read-write conflict is this. If there are n replicas, and writes occur to some w replicas, and reads occur from some r replicas, then w > (n - r), which can also be expressed as w + r > n. The idea here is that we need to ensure that reads and writes overlap.
Read-read conflicts aren't a problem -- one read doesn't affect another read, even if they occur form disjoint sets of replicas. Read accesses data, it doesn't change it, so it can't affect future operations.
Write-write conflicts can occur any time the write quorum is less than "all", because a read may discover multiple versions. It isn't enough that a read discovers one instance of the newest version -- it must also know which of the versions it discovers is newest. There are a few common solutions to this problem:

Keep logical timestamps, in the form of version numbers, associated with each replica. Set these version numbers by incrementing the newest version number read from a read quorum -- such as when reading the object before updating it or immediately before updating it. Using the first approach gives us a way of detecting and rejecting stale writes, while the second forces the most recent write, even if it is based on a stale version. In any case, for all of the reasons we discussed earlier, version numbers as a form of logical timestamp, are way better than physical timestamps, which are very, very costly to synchronize sufficiently well.
Ensure that the write quorum is large enough, such that it is guaranteed to cover a majority of the read quorum, then reads can accept the majority version, e.g. w > (n - r/2).

Quick Note About Locking/Mutual Exclusion

I want to emphasize that all of our prior discussions about mutual exclusion are applicale here. Managing replicas is fraught with concurrency-related peril. What happens if two writes occur concurrently, i.e. overlapping across time across the network? What about if a version number changes during an update? A change to an object while a read quorum is being established and read?
These, among others, are critical sections, just like any others. Insert our entire conversation of last week here. But, with a little more nuance. In particular, where do locks come from?
Well. We have a read quorum. We have a write quorum. We need a lock quorum. The lock quorum needs to be big enough to ensure that it, just like a read, covers the newest version. So, it needs to be at least as big as a read quorum. And, it needs to be big enough to esnure that it covers each replica that will be written. Thus, it needs to be as large as the write quorum. The upshot is that the lock quorum needs to be at least as large as each, i.e. L = MAX(r,w). And, each fo the read quorum and write quorum need to be drawn from it.

Processor Failure, Partitioning, and Replica Control

Now let's consider the impact of our policy decisions on availability in the light of failure. Let's assume that a single processor fails in a system using a write-all/read-one policy. In this system, reads will be unaffected, but writes will not be possible. The not-so-useful write-one/read-all policy would allow writes, but not reads.
Is it possible for us to define a policy such that both reads and writes can continue after a failure? Perhaps. Consider a system that has 5 replicas and requires 3 replicas to write and 3 replicas to read. This system can continue to read and write, even if 2 processors fail. If both reads and writes require a majority of the processors, they can continue, despite a failure of the minority of the processors. But the price that we pay is extra communication in the common case of a functioning system -- the quorums are larger.
Tannenbaum and Van Renesse suggest another approach called voting with ghosts. This approach allows the counting of dead processors toward a write quorum. Basically a ghost processor is set up that votes on behalf of the dead processor. When it gets a new object via a write, it just throws it away (it is a ghost, afterall).
But this approach is somewhat problematic -- how does one know if a processor has failed, or if communication has failed? Sometimes, we may be able to tell the difference, but for the most part, it is impossible (remember the discussion of failure last class). Now consider a partitioning of the network. Processors in each partition will assume that the processors in the other partitions are dead and ghosts will cast their votes. Now both parititions are receiving updates. Once the network partioning is repaired, there will be a write-write conflict.
For this reason, voting with ghosts isn't very practical and other than the original publication, as far as I know, it is only discussed in Tannenbaum's own textbook. But I like to talk about it, because I'll use it as a bridge to discuss something later on -- so don't completely force it out of your mind.

Static Quorums

The decision about how many replicas should be involved in operations is known as quorum selection. What we have discussed so far implies a set of rules for selecting read and write quorums:

There is a read quorum, r such that at least r replicas must be accesed by a read operation.
There is a write quorum, w such that at least w replicas must be accesed by a write operation.
Given n replicas, r + w > n
w > (n - r/2), or version numbers must be used.
A less formal statement of these rules follows:

A read quorum is required for a read to succeed
A write quorum is required for a write to succeed
A write quorum is required for a write to succeed
Read and write quorums must always interect
Version numbers (based upon a read quourum) must be used, or writes must cover a majority of the read quorum

Voting with Static Quorums

A version of the static quorum technique called, Voting with Static Quorums provides a mechanism for assigning an importance to various replicas.
It works exactly like the simple Static Quorum approach above, except that not all replicas count equally. Each replica is assigned a particular numbr of votes. Now, instead of defining a quorum in terms of a number of replicas, it is defined in terms of a number of votes. But the same rules as above still apply: we still need version numbers or synchronized timestamps, read and write quorums must still intersect, and writes require a majority of the votes.
This approach gives us a way of dealing with cached copies as replicas -- we can assign them 0 votes. Perhaps 0-vote replicas will require a version check from a read-quorum, but not a full data transfer.
It also gives us a way of prevent a bunch of unreliable servers from preventing a quorum. Now they can be given a low number of votes, but perhaps they'll have a useful replica in the event of a failure.

Coda and Replication

Coda is a filesystem derived from version 2 of the AFS file system that we use on campus. It implements replication for writeable volumes. Let's take a quick look at its approach, which is based on a form of vector logical time, known as a Coda Version Vector (CVV), and which is interesting because it uses an optimistic approach and allows clients to resolve conflicts latently.
Each CVV contains one entry for each host server. Each entry is the version number of the file on the corresponding server. In the perfect case, the entry for each replica will be identical. But, should an update reach only a portion of the servers, some servers will have newer versions than others.
In Coda, the client request a file via a three-step process.

It asks all replicas for their version number
It then asks the replica with the greatest version number for the file
If the servers don't agree about the files version, the client can direct the servers to update a client that is behind, or inform them of a conflict. CVVs are compared just like vector timestamps. A conflict exists if two CVVs are concurrent, because concurrent vectors indicate that each server involved has seen some changes to the file, but not all changes.

In the perfect case, when the client writes a file, it does it in a multi-step process:

The client sends the file to all servers, along with the original CVV.
Each server increments its entry in the file's CVV and ACKS the client.
The client merges the entries form all of the servers and sends the new CVV back to each server.
If a conflict is detected, the client can inform the servers, so that it can be resolved automatically, or flagged for mitigation by the user.

Given this process, let's consider what happens if one or more servers should fail. In this case, the client cannot contact the server, so it temporarily forgets about it. The collections of volume servers that the client can communicate with is known as the Available Volume Storage Group (AVSG). The AVSG is a subset of the full VSG.
In the event that the AVSG is smaller than the VSG, the client does nothing special. It goes throguh the same process as before, but only involves those servers in the AVSG.
Eventually when the partitioned or failed server becomes accessible, it will be added back to the AVSG. At this point, it will be involved in reads and writes. When this happens, the client will begin to notice any writes it has missed, because its CVV will be behind the others in the group. This will be automatically fixed by the a write operation.
Coda clients also periodically poll the members of their VSG. If they find that hosts have appeared that are not currently in their AVSG, they add them. When they add a server in the VSG back to the AVSG, they must compare the VVV's. If the new server's VVV does not match the client's copy of the VVV, thier is a conflict. To force a resolution of this conflict, the client drops all callbacks in the volume. This is because the server had updates while it was disconnected, but the client (because it couldn't talk to the server) missed the callbacks.
Now, let's assume that the network is partitioned. Let's say that half of the network is accessible to one client and the other half to the other client. If these clients play with different files, everything works as it did above. But if they play with the same files, a write-write conflict will occur. The servers in each partition will update their own version numbers, but not the other. For example, we could see the following:
  Initial:     <1,1,1,1>
               <1,1,1,1>
               <1,1,1,1>
               <1,1,1,1>

  --------- Partition 1/2 and 3/4 ----------
  Write 1/2:   <2,2,1,1>
               <2,2,1,1>

  Write 3/4:   <1,1,2,2>
               <1,1,2,2>


  --------- Partition repaired ----------
  Read (ouch!) <1,1,2,2>
               <1,1,2,2>
               <2,2,1,1>
               <2,2,1,1>
  
The next time a client does a read (or a write), it will detect the inconsistency. This inconsistency cannot be resolved automatically and must be repaired by the user. Coda simply flags it and requests the user's involvement before permitting a subsequent access.