Distributed Concurrency Control
Last class we discussed building concurrency control primitives from atomic shared memory. It was a nice review -- and reminder why the familiar primitives, implemented in familiar ways, don't work in Distributed Systems. Today, we look at mechanisms that we can use in distributed systems.
Keep in ind that, the mechanisms discussed today can be incorporated into familiar APIs. For example, "mutex_lock(lock)" and "mutex_unlock(lock)". The trick is what we are doing under the hood -- and the time and effort that is required and robustness/availability provided.
Base Case: Centralized Approach
Although centralized approaches have their standard collection of shortcomings, including scalability, fault-tolerance, and accessibility they provide a useful starting point for discussions. So we'll begin by discussion a centralized approach to ensuring mutual exclusion for a critical section:
- A Central Coordinator is needed.
- This coordinator can be appointed or elected
- The coordinator is responsible for granting requests to enter the critical section
- It ensures that only one thread is in the critical section at a time.
- If the critical section is in use, the request is enqueued otherwise, the request is immediately granted
- Enqueued requests are granted when the critical section becomes available
- Want into the CS?
- Ask the coordinator & wait for permission.
- Done with the critical section?
- Tell the coordinator
- The coordinator will then let the next thread in, if any.
- Good Traits
- It does guarantee mutual exclusion
- It only requires 3 messages per critical section entry (request, permission, done)
- The coordinator dies...then what?
- Thread dies in the critical section
- In either of the above cases, it is hard to tell what has happened
- Standard centralized problems
Timestamp Approach (Lamport)
Another approach to mutual exclusion involves sending messages to all nodes, and ordering requests using Lamport logical time. The first such approach was described by Lamport. It requires that ties be broken using host id (or similar value) to esnure that there is a total ordering among events.
This approach is based on the notion of a global priority queue of requests for the critical section. This queue is ordered by the logical time of the request. Unlike the central algorithm we discussed as the "base case", this approach calls for each node to maintain a copy of this queue. The copies are maintained in a consistent way using a request-reply protocol.
When a node wants access to the critical section, it sends a REQUEST message to every other node. This message can be sent via a multicast or a collection of unicasts. This message contains the logical time of the request. When a participant receives this request, it adds it to its priority queue and sends a REPLY message to the requesting node.
The requesting node takes no action until it receives all of the replies. This ensures that the request has been entered into all of the queues, and that, at least with respect to this request, the queues are consistent. Once it receives all of the replies, the request is free to go -- once its turn arrives.
If the critical section is available (the queue was previosuly empty), the request can go as soon as the last REPLY is received. If the critical section is in use, the request must wait.
When a node exits the critical section, it removes itself from its own queue and sends a RELEASE message to every other participant, perhaps by multicast. This message directs these nodes to remove the now-completed request for the critical section from their queues. It also directs them to "peek" at their queue.
If the first request in a hosts queue is itself, it enters the critical section. Otherwise, it does nothing. A host can enter the critical section if it is at the head of its own queue, because the REPLY ensures that it will be at the head of every other node's queue.
The RELEASE message does not need an ACK or a REPLY, because it does not matter if its arrival is delayed. Since we are assuming a reliable unicast or multicast, the RELEASE will eventually reach each participant. We don't care if it arrives late -- this doesn't break the correctness of the algorithm. In the worst case, it is delayed in its arrival to the next requestor to enter the critical section. In this case, the critical section will go unused until the RELEASE arrives and is processed by the host. In the other cases, it delays the host in "peeking" at the queue, but this is without consequence -- the delayed host wasn't going to enter the critical section, anyway.
But wait! Why do we need the REPLY to the REQUEST, then? Can't we just get rid of that. Well, not exactly. The problem is that a reliable protocol guarantees that a message will eventually arrive at its destination, but makes no guarantees about when. The protocol may retransmit the information many, many times, over many, many timeout periods, before successfully delivering the message.
In the case of the RELEASE message, timing is not critical. But this is not the case for the REQUEST message. The REQUEST message must be received, before the requesting node can enter the critical section. This is the only way of ensuring that all nodes will see the same head node, should a RELEASE message arrive. Otherwise, two different hosts could look at their queues, determine that they are at the head, and enter the critical section -- disaster. This disaster could be detected after-the-fact when the belated REQUEST arrives -- but this too late since mutual exclusion has already been violated.
This approach requires 3(N - 1) messages per request: REQUEST, REPLY, and RELEASE must be sent to every other node. It isn't very fault-tolerant. Even a single failed host can disable the system -- it can't REPLY.
Timestamp Approach (Ricarti and Agrawala)
The Lamport approach described above was improved by Ricarti and Agrawala. Ricarti and Agrawala observed that the REPLY and RELEASE messages could be combined. This is achieved by having the process that is currently within the critical section delay its REPLY until it exists the critical section. In order to do this, each process must queue REQUESTs while within the critical section.
In many respect, this change converts this approach from a "global queue" approach to a "voting" approach. A node requests entry to the critical section and enters the critical section as soon as it has received an OK (REPLY) vote from every other node.
The details of this approach follow:
- Build a message
- Send message to all participants
- If not in CS and don't want in, reply OK
- If in CS, enqueue request
- If not in CS, but want into the CS, and the requestor's time is lower, reply OK (messages crossed, requestor was first)
- If not in CS, but want into the CS, and the requestor's time is greater, enqueue request (messages crossed, participant was first)
- On exit from CS, reply OK to everyone on queue (and dequeue each)
- Once received OK from everyone, enter CS
This approach requires 2*(n - 1) messages, that is one message to and from everyone except self. This is an (n - 1) improvement over Lamport's approach.
But it fails to address the more serious limitation -- fault tolerance. Even a single failure can disable the entire system. Both timestamp approaches require more messages than a centralized approach -- and have lower fault tolerance. The centralized approach provides one single point of failure (SPF). These timestamp approaches have N SPFs.
In truth, it is doubful that we would every want to use either approach. In practice, centralized coordinators and ring approaches are the workhorses. Centralized coordinators can be made more fault tolerant using coordinator election (comming soon).
But these timestamp approaches are the most distributed -- they involve every host in every decision. They also illustrate some important examples of global state, logical time, &c -- and so they are a valuable part of this (and any) distributed systems course.
Mutual Exclusion: Voting
Last class we discussed the Ricarti and Agrawala approach to ensuring mutual exclusion. It was much like asking hosts to vote about who can enter the critical section and allowing access only upon unanimous consent. But is unanimous consent necessary? Can't we get away with a simple majority since two hosts can't concurrently win a majority of the votes.
In a simple form, it might operate similarly to a democratic election:When entry into the critical section is desired:
- Ask permission from all other participants via a multicast, broadcast, or collection of individual messages
- Wait until more than 50% respond "OK"
- Enter the critical sectionWhen a request from another participant to enter the critical section is received:
- If you haven't already voted, vote "OK."
- Otherwise enqueue the request.When a participant exits the critical section:
- It sends RELEASE to those participants that voted for it.When a participant receives RELEASE from the elected host:
- It dequeues the next request (if any) and votes for it with an "OK."
Ties and Breaking Ties
So far, this approach is looking nice, but it does have a problem: ties. Imagine the case such that no processor gets a majority of the votes. Consider, for example, what would happen if each of three processors got 1/3 of the votes. Ouch!
Ties can, in fact, be broken at a somewhat high cost. If we use Lamport time with total ordering via hostid, no two messages will have concurrent time stamps. Messages that would otherwise be concurrent are ordered by hostid.
Recall that a host votes for a candidate as long as it has no outstanding votes. This becomes problematic if its vote turns out to be premature. This occurs if it votes for one candidate to later receive a request, bearing an earlier timestamp, from another candidate.
At this point, one of two things might be occuring. The system might be making progress -- the "wrong" host might have gotten more than 50% of the votes. If this is the case, we don't care. It might not be fair, but it is an edge case.
Another possibility is that no host has yet received a majority of the votes. If this is the case, it could be because of deadlock. It might be that each candidate got the same number of votes. This is the case that requires mitigation.
So, upon discovering that it voted for the "wrong" candidate, a host needs to determine which of these two situations is the case. It sends an INQUIRE message to the candidate for who it voted. If this candidate won the election, it can just ignore the INQUIRE and RELEASE normally when done. But, if it hasn't yet entered the critical section, it gives back the vote and signals this by sending back a RELINQUISH. Upon receipt of the RELINQUISH, the voter is free to vote for the preceding request.
Analysis and Looking Forward
This approach certain has some nice attributes. It does, in fact, guarantee mutual exclusion. And, it can allow a host to enter the critical section even if 1/2 of the hosts are down or unreachable.
But it has non-trivial costs. Nominally, it takes 3 messages per entry to the critical section (request, vote, release), about the same as a timestamp approach. And, in the event that votes arrive in exactly the wrong order, an INQUIRE-RELINQUISH pair of messages can occur for each host.
What we need is a way to reduce the number of hosts involved in making decisions. This way, fewer hosts need to vote, and fewer hosts need to reorganize thier votes in the event of a misvote.
Mutual Exclusion: Voting Districts
In order to address to reduce the number of messages required to win an election we are going to organize the participating systems into voting districts called coteries (pronounced, "koh-tarz" or "koh-tErz"), such that winning an election within a single district implies winning the election across all districts.
Coteries is a political term that suggests a closed, somewhat intimate, and conspiring collection of actors (persons, states, trade organizations, unions, &c), e.g. a "Boy's Club".
This can be accomplished by requiring that elections within any district be won by unanimous vote and then Gerrymandering each processor's district to ensure that all districts intersect. Since the subset of processors that are members of more than one district can't vote twice, they ensure that only one of the districts can gain a unanimous vote.
Gerrymandering is a term that was coined by Federalists in the Massachusetts election of 1812. Governor Elbridge Gerry, a Republican, won a very narrow victory over his Federalist rival in the election of 1810. In order to improve their party's chances in the election of 1812, he and his Republican conspirators in the legislator redrew the electoral districts in an attempt to concentrate much of the Federalist vote into very few districts, while creating narrow, but majority, Republican support in the others.
The resulting districts were very irregular in shape. One Federalist commented that one among the new districts looked like a salamander. Another among his cohorts corrected him and declared that it was, in fact, a "Gerrymander." The term Gerrymandering, used to describe the process of contriving political districts to affect the outcome of an election, was born.
Incidentally, it didn't work and the Republicans lost the election. He was subsequently appointed as Vice-President of the U.S. He served in that role for two years. Since that time both federal law and judge-made law have made Gerrymandering illegal.
The method of Gerrymandering disticts that we'll study was developed by Maekawa and published in 1985. Using this method, processor's are organized into a grid. Each processor's voting district contains all processors on the same row as the processor and all processors on the same column. That is to say that the voting district of a particular processor are all of those systems that form a perpendicular cross through the processor within the grid. Given N nodes, 2*SQRT(n) - 1 nodes will compose each voting district.
Using this approach, any pair of voting districts will intersect via at least one node, so two disticts cannot be one unanimously at the same time.
The voting district of processor 7
Here's what a node does, if it wants to enter the critical section:
- Send a REQUEST to every member of its district
- Wait until every member of its district votes YES
- Enter the critical section
- Upon exit from the CS, send RELEASE to each member of its district.
If a node gets a REQUEST, it does the following:
- If it has already voted in an outstanding election (it voted, but hasn't received a corresponding RELEASE), enqueue the request.
- Otherwise send YES
If a node gets a RELEASE:
- Dequeue oldest request from its queue, if any. Send a YES vote to this node, if any.
As we saw with simple majority voting last class, this approach can deadlock if requests arrive in a different order at different voters. This can allow different voters within overlapping districts to vote for different candidates. In particular, it can allow for a "split" between the two voters that are the overlap between two districts.
Fortunately, we can use the same approach we discussed last class to recover from this situation if it becomes problematic:
- A node records Lamport time w/total ordering before it sends a request. It sends this time with the request to all members of its district (the same time).
- Each voter uses a priority queue based on the time of the request.
- If a node receives a request with a time-stamp more older than the timestamp of a request for which it already voted, but for which it has not received a RELEASE, it attempts to cancel its vote. It does this by sending the candidate an INQUIRE.
If this node hasn't won the election, it forgets about our vote and sends us a RELINQUISH. Once we receive the RELINQUISEH, we vote for the older request and enqueue the candidate for which we originally voted.
If the candidate was already in the CS, no harm was done -- deadlock did not actually occur. When it goes out, we can vote for the other candidate. In this case, the processors may not have entered the CS in FIFO order, but that's okay -- deadlock didn't happen.
This approach requires about 3*(2*SQRT(N)-1) messages -- much nicer than 3*N messages. But it is not very fault tolerant, since a unanimous victory is required within a district. (Some failure can be tolerated, since failures outside of a district don't affect a node).
At this point, we've considered several different ways of approaching mutual exclusion: a centralized approach, a couple of timestamp approaches, a voting approach, and voting districts. Another approach is to create a special message, known as a token, which represents the right to access the critical section, and to pass this around among the hosts. The host which is in possesion can access the shared resource -- the others cannot. Think of it as the key to the gas station's bathroom. Since there is only on key, mutual exclusion is ensured.
Token Ring Approach
The first among these techniques is perhaps the simplest -- and certainly among the most frequently used in practice: token ring.
With this approach, every system knows its successor. The token moves from system to system through the list. Each system holds the token until it is done with the CS, and then passes it to its successor.
We can add fault tolerance to this approach if every host knows the mapping for all systems in the ring. If a successor dies, then the successor's successor, and successor's successor's successor, and so on can be tried. A host assumes that a system has failed if it cannot accept the token.
What happens if a system dies with the token? If there is a known time-out period, the origin machine can regenerate the token and start circulating it again. Depending on the nature of the CS, this could be dangerous, because multiple tokens could exist. If only one has access to the resource, this might be a problem.
The number of messages required per request is very interesting. Under high contention, the number is very, very low -- as low as one. If every system wants entry to the CS, each message will yield another entry. But if no one wants access to the CS, messages will occur for no reason.
But in general, we are more concerned about traffic when congestion is high. That makes this algorithm particularly interesting. It is especially interesting in real-time systems, because the worst-case behavior is well-bound and easily computed.
When possible, especially in distributed environments, which are inherently failure-prone, we don't want to give a user a permanant right to a resource. The user might die or become inaccessible, in which case the whole system stops.
Instead, we prefer to grant renewable leases with liberal terms. The basic idea is that we give the resource to the user only for a limited amount of time. Once this time has passed, the user needs to renew the lease in order to maintain access to the shared resource. Within the last ten years or so, almost all mutual exclusion and resource allocation systems have taken this approach, which is expecially well suited for centralized approaches.
The amount of time for the lease should be long enough that it isn't affected by reasonable drift among synchronized physical clocks. But, it should be short enough that the time wasted after the end of the task and before the lease expires is minimal. It is also possible to allow the user to relinquish a lease early.
The other problem is enforcement -- the user must be unable to access the resource after the credential expires. There are basically two ways of doing this. The leasing agent can tell the resource of the leasee and the term, or the leasor can give the leasee a copy of the lease to present to the resource.
In either case, cryptography is needed to ensure that the parties are who they claim to be and that the lease's content is not altered. We'll discuss how this can be accomplished in more detail later. But, for now, let me just offer that it is often done using public key cryptography.
This can be used to authenticate the parties, such as the leasor the leasee, or the resource, and it can also be used to make the lease unalterable by the leasee.
Token ring is very efficient under high contention, but very inefficeint under low contention. Another way of approaching token-based mutual exclusion is to organize the hosts into a tree instead of a ring. This organization allows the token to travel from host-to-host, traversing far fewer unnecessary hosts.
Raymond's algorithm is one such approach. It organizes all of the nodes into an unrooted n-ary tree. When the system is initialized, one node is given the token -- the privlege to enter the critical section. It may or may not need the privlidge -- but someone needs to have it. The other nodes are organized so that they form a tree. The edges of this tree are directional -- they must always point in the direction of the token.
The example below shows an example of a tree initialized for Raymond's algorithm. Please note that I have drawn it as a binary tree as a force of habit -- this is not necessary. Another way of describing an "unrooted n-ary tree" is as "a graph without cycles".
Let's trace the execution of the algorithm through several requests for and releases of the critical section. Let's begin by assuming that things are as they appear in the figure above and that host 1 is within the critical section.
Given these circumstances, the figure below depicts the state of the system after host 7 requests entry into the critical section. Process 7 enqueues its own request and then sends a request to host 3. Process 3 enqueues host 7's request and makes a request as a proxy to host 1, which in turn enqueues host 3's request.
Now let's repeat the process above, this time for a request from host 2.
Now, let's see what happens when host 6 requests entry into the critical section. The big difference in this case is that host 3's queue is not empty. Since host 3 has already requested the token, it will not request the token again in response to host 6's request.
Now, let's assume that host 1 exits the critical section. It will dequeue the first host from its queue of requestors and send the token to it. It will then set it's current_direction pointer to point to this node. Once the destination host gets the token, it will set its current_direction pointer to null, effectively changing the direction of the edge. Since host 1's queue of requests is not empty, it will send a message to host 3 requesting the token. This is necessary to ensure that it can satisfy the enqueued requests.
Once host 3 gets the token, it will dequeue the head of its request queue. Since the requstor is host 7, not itself, it will send the token to host 7 and set the current_direction pointer to point to host 7. Since its queue is not empty, it will send a request to host 7 to ensure that it can satisfy its pending request from host 2.
When host 7 finishes with the critical section, it will send the token to host 3, which will in turn send the token to host 6. Host 3 will also make a request to host 6. This will ensure that it gets the token back and can satisfy host 1's request.
Once Host 6 is done with the critical section, the next chain of events will send back the request chain through hosts 3 and 1 to host 2, which will enter the critical section:
One interesting thing to note about the execution of Raymond's algorithm is that requests are not necessarily satisfied in the same order in which they were made. In the example aboce, the requests are amde in the following order: 7, 2, 6, but granted in the order: 7, 6, 6.
The first time I looked at this algorithm and noticed this property, I became very concerned. The little robot over my right shoulder was yelling, "Starvation! Starvation!" I realized that the system does not have a global queue. It has a collection of local queues. These local queues are organized using the tree's current_direction relationship, not FCFS. The result is that local requests are given some preference over distant requests. Certainly starvation would be possible.
Well, not exactly. Whereas the algorithm isn't fair, starvation cannot occur. The algorithm does guarantee a maximum path length between a requestor and a holder. The result is that starvation is not possible. If you are interested in the proof of this, please drop by -- I'm happy to go over it with you -- as always, we're here to help.
Raymond's approach limits the number of messages required to obtain access to the critical section by only communicating with those nodes along the path between the token and the requestor in the tree, and furthermore by terminating the request chain as soon as it would begin to overlap another prior request and as a consequence have no impact.
But, in the worst case, a message could still have to travel all of the way and down the tree, and the token in the reverse direction -- even if no intermediate nodes need it. It would be nice if we could simply send the token to the requestor, without the rest of the scenery.
Path compression, which was originally developed by Li and Hudak for use with distributed shared memory, allows for this type of "short cut". It is based on a queue of pending requests. This queue is maintained implicitly by two different types of edges among the nodes: current_dir and next.
Each node's current_dir edge leads to its best guess of the node that is at "the end of the line" of hosts waiting for access to the critical section. The node "at the end of the line" has this pointer set to itself.
The next edge is only valid for those nodes that either have the token or have requested the token. If there is a next edge from node A to node B, this indicates that node A will pass the token to node B, once it has exited the critical section. Nodes which have not requested the critical section, or have no requests enqueued after them, have a null next pointer.
In this way, the next pointer forms the queue of requests. The next pointer from the token holder to the next node forms the head of the list. The next pointer from that point forward indicates the order in which the nodes that have made requests for the critical section will get the token.
The current edge of each node points to that node's best guess about the last node in the queue maintained by the next edges. Current edges may be out of date. This is because a node may not be aware of the fact that additional nodes have been enqueued. But this is okay. A request can follow the current edge to the "old" end of the queue. This node will in turn lead to a node farther back in the list. Eventually, the request will come to the end of the list.
Once a request reaches the end of the current edge chain, it will also be at the back of the queue maintained by the next pointers, so it can "get in line" and take its place at the end of the queue.
How do the current edges get updated? Well, as a request is perculating through the nodes via the current edges, each node's current edge is adjusted to point to the requesting node. Why? Well the requesting node is the most recent request and is (or will soon be) at the end of the request queue.
Let's walk through an example. Let's begin with some unrooted tree of nodes, giving one the token. The other nodes have current edges that eventually lead to the token holder. All of the next pointers are null, since no requests are enqueued.
Now, let's see what happens if node 6 requests access to the critical section. It will forward its request to node 4, and then set its own current_dir edge to itself to indicate that it is at the end of the list. Node 4, will forward its request to node 3. Both node 4 and node 3 will reset their current_dir edges to point to node 6. Since node 3's next pointer has the token and its next edge is null, it will set its next edge to point to node 6. This indicates that it will give the token to node 6 once it is done with the critical section.
If node 2 makes a request for the critical section, its request will propogate through node 3 to the end of the current_dir chain. Nodes 3 and 6 will adjust their current_dir pointers. Node 6 will also adjust its next edge from null to point to ndoe 2. Please note that the next edges are forming the queue of nodes requesting the critical section and that the current_dir pointers are being updated as messages propogate through their origin nodes.
At this point, let's see what happens when node 3 leaves the critical section. It will pass the token to node 6, and set its own next pointer to null. Its understanding of the tail of the queue hasn't changed, so its current_dir edge does not change.
The same process will be followed when node 6 exits the critical section and passes the token to node 2:
It should be relatively straightforward to see that this approach reduces the number of messages per request with respect to Raymond's approach -- we only passed the token to an actual requestor, never as a simply matter of an intermediate hop.