CSE 120 Discussion Notes: Week 4

Advanced Synchronization Techniques, or, Do We Need Locks?

Semaphores, locks (spinlocks and mutexes), and condition variables are the standard tools used to write safe multithreaded code. You'll see them if you look inside a typical modern operating system kernel designed for multi-processor (SMP) systems.

But there are times when these standard synchronization mechanisms don't quite cut it. Perhaps it's a piece of performance-critical code, or code which has trouble scaling to large numbers of processors because of lock use. If you dig around in Linux and other systems, you'll find some other clever approaches to synchronization, too.

First a disclaimer: You probably shouldn't be using these methods in your code, unless you understand the issues involved, have good reasons to avoid standard synchronization primitives, and are willing to write some architecture-specific code. Even then, you should think twice. But they can be fun to learn about.

Lock-Free Data Structures with CMPXCHG

Remember transactional memory from discussion last week? There, we were able to write programs without any explicit locking—only marking code blocks that we wanted to be atomic—and the processor would take of the details for us. Unfortunately, there aren't any processors you can buy that will do this now. However, we may able to get some of the same performance benefits—not having to use explicit locks—in restricted cases by doing a lot of the work ourselves in software.

Here's a simple example: we'll implement a linked list and give it stack operations: push an item onto the list and pop an item off the list. We'd like multiple threads to be able to access the list concurrently, without the need for locks.

The data structure is simple and what you'd expect:

struct item {
    int value;
    struct item *next;
};

struct item *stack_top = NULL;

Here is how we might implement the push and pop operations for a single-threaded program:

void push(int value)
{
    struct item *node = new item;
    node->value = value;
    node->next = stack_top;
    stack_top = node;
}

int pop()
{
    struct item *node = stack_top;
    if (node == NULL)
        return -1;    // Signal the stack is empty

    stack_top = node->next;

    int value = node->value;
    delete node;
    return value;
}

This code won't work as written if multiple threads may be calling push or pop simultaneously. To implement it without locks, we'll rely on the compare-and-exchange (or compare-and-swap) operation. The behavior of compare-and-exchange is essentially

int cmpxchg(int source, int expected, int *dest)
{
    int old_value = *dest;
    if (old_value == expected)
        *dest = source;
    return old_value;
}
except that the code executes atomically. This would be implemented as a single instruction in the processor. In words, compare-and-exchange compares the value at a memory address with an expected value, and if the two are equal, it writes a third value into memory. If the values don't agree, it does nothing. In either case, it will return the value which was actually found in memory. (So here, if the return value is equal to expected, then we know the write to memory occurred.)

There are different variants of compare-and-exchange that might be available, depending upon the system; the key idea is that we have a single atomic instruction which can check a value in memory against an expected value, if it matches write a different value, and let us know whether it succeeded or not.

How can we use this in our linked-list example? We'll use cmpxchg to actually update the stack_top variable. If another thread tries to update simultaneously, the thread that goes second will notice the problem, go back, and retry the operation. So in the push function, we'll get the new item ready for the front of the list, and then atomically try to put it into place. If stack_top changed while we were doing this, we'll have to fix up the next pointer on the new item, and then retry. pop works similarly.

void push(int value)
{
    struct item *node = new item;

    node->value = value;
    struct item *old_stack_top;
    do {
        old_stack_top = stack_top;
        node->next = old_stack_top;
    } while (cmpxchg(node, old_stack_top, &stack_top) != old_stack_top);
}

int pop()
{
    struct item *node;
    do {
        node = stack_top;
        if (node == NULL)
            return -1;    // Signal the stack is empty
    } while (cmpxchg(node->next, node, &stack_top) != node);

    int value = node->value;
    delete node;
    return value;
}

(Actually, there's still a potential small race condition in the code for pop as things are written. Can you spot it? What happens if two threads execute pop nearly simultaneously? What effect might the delete node line in the first thread have? Hold this thought for the next section.)

Though it is simple, compare-and-exchange is powerful enough that it can be used to implement a wide variety of lock-free data structures. It is more powerful than the test-and-set; while we could use test-and-set to implement locks, test-and-set is not sufficient for implmenting all these lock-free data structures.

Read-Copy-Update (RCU)

Suppose that you have a data structure which is read by very many threads, often concurrently, but is only very rarely written to. For accesses to be safe, you'd ordinarily need to use a lock. Since most threads are reading, we could use a reader-writer lock to allow more concurrency. But if most threads are only accessing the data only very briefly, even this will not help much. (Why? Consider our implementation of a reader-writer lock from lecture, and think about what will happen if many, many threads are all trying to acquire a read lock at the same time.)

The Linux kernel supports a technique called Ready-Copy-Update, or RCU, for cases such as these. By observing a few restrictions, RCU allows all the readers to access the data concurrently, without ever having to acquire any locks!

For a concrete example: suppose our kernel is keeping track of the date and time:

struct datetime {
    int year, month, day;
    int hour, minute, second;
};

struct datetime now;
We set up a timer to update now once per second, so that it always contains the current time. What could happen if another piece of code tries to print the date, and is interrupted by the timer after it has read the year, but before the month and day?

We'd like is for readers to be able to see a consistent version of the data, without having to use a lock to synchronize access. RCU solves this problem by:

  1. Replace now with a pointer to a datetime object.
  2. To read, first save a copy of the now pointer. Then, use that pointer to access all the data values needed.
  3. In the timer interrupt, do not overwrite the fields pointed to be now. Instead, create a copy of the object, and make all changes to the copy.
  4. When the copy is initialized, update the pointer to refer to the new object. Since this only needs to write a single pointer, it happens atomically.
  5. Finally, at some point in the future, when the system knows that no readers are still using the old version, that object can be deleted.

How does the system know when no thread is still using the old object? One possibility, if the kernel scheduler is not pre-emptive, is to establish the rule that a thread cannot start to read an RCU object, yield the CPU (perhaps by going to sleep or blocking on a lock), and then finish reading the RCU object. Once all processors in the system have performed a context switch, it is safe to delete old objects. (Why?)

Can you think of other mechanisms that could be used to decide when old objects can be reclaimed?

If you'd like to read more, see the Wikipedia article on RCU, which also contains links to other references.

Questions to Consider

If we have time, we may discuss some of these questions in discussion. Feel free to think about them ahead of time, though don't worry if you don't have the chance.

  1. The Unix C library was mostly developed before threading, and so many functions in it are not thread-safe. What does this mean? What types of problems can come up with using the global variable errno to report errors? How might you fix these problems?
  2. Early versions of Java made it nearly impossible to write event-driven servers, because I/O was blocking. What effect might this have on the number of threads in a Java server? In light of this, why do you think Java implementations may have largely avoided a 1-to-1 mapping of Java threads onto kernel threads?
  3. In the Java language, each object gets a single combined lock/condition variable. In Nachos, we can allocate any number of locks we want per object, and we can have any number of condition variables per lock. What are the advantages and disadvantages of each approach?
  4. We've said that semaphores are equivalent to locks and condition variables, in that any problem which can be solved with one can be solved with the other. Help prove this by showing how semaphores can be implemented with locks and condition variables.
  5. Hoare and Mesa monitors have different semantics. With Hoare semantics, calling Signal() transfers control immediately to one of the waiting threads, with no chance for a thread outside the monitor to enter first. Mesa makes no such guarantees. Suppose we are using Mesa-style semantics for our locks and condition variables, but we wish to make sure one of the threads waiting on the condition variable is given a chance to run first when we call Signal(). How might we implement this? (For simplicity, you only have to think about the case where the first thread calls Signal() as the last operation before releasing the lock—that is, the original thread will not need to reacquire the lock after control is transferred to the waiting thread.)