Shared memory designs are broken down into two major categories, depending on whether or not the access time to shared memory is uniform or non-uniform. These machines are referred to as UMA=Uniform Memory Access, and NUMA=Non-Uniform Memory Access, respectively. With UMA, the cost of accessing main memory is the same for all memory addresses in the absence of contention. UMA are often called flat shared memory, and the machines that are built on top of such memory are called Symmetric Multiprocessors. (To learn more about SMPs, see the web page on Commercial Symmetric Multiprocessors.)
Processors inevitably contend for memory, i.e. they access the same memory module or even the same location simultaneously. When severe, contention effectively serializes memory accesses which no longer execute in parallel but one at a time. Though high end servers scale to many tens of processors, e.g. IBM 'p' series to 64 cores, Sun Fire E servers to 72 cores, the interconnect of higher end design would be too costly today. An alternative is to employ physically distributed memory, or a NUMA archictecture. In a NUMA architecture, memory access times are non-uniform. A processor sees different access times to memory, depending on whether the access is local or not, and if not, on the distance to the target memory. Access to remote memory owned by another processor is more expensive. Complex hierarchies are possible, and memory access times are highly non-uniform. Large scale designs with up to 1024 processors are commercially available today. It is clear that the term "distributed memory" must be properly qualified since it arises in shared memory and message passing architectures.
In addition to the above variety of shared memory we also have a `get/put' model, which is supported by the Cray T3E. This model can be viewed as a hybrid of message passing and shared memory. Each processor can directly access remote (non-local) memory, though it must explicitly designate which processor's memory it will access (with true shared memory, such mapping is handled transparently by the hardware.) We call this type of communication "one-sided," since only one processor needs to be involved in the communication. Ordinary message passing is two-sided in the sense that both sender and receiver must be aware of one another. MPI-2 supports one-sided communication.
While cc-NUMA architectures add specialized support for shared memory, e.g. coherence control, they still rely on fine-grained message passing involving short messages. So do single sided architectures. So it appears that designs are converging, with the important detailed handled through a combination of software and specialized support. Message-based communication is a unifying model, and may be used in part to understand computer performance. Distributed memory designs are also converging in another sense: interconnection topology (at least within a single machine) is much less of an issue than it was in the early days of parallel computing, and programmers may often assume that nodes can communicate with another at a uniform rate.
In a shared memory parallel computer all processors have direct access to all of memory. This shared memory abstraction supports automatic (hardware-assisted) address mapping and is usually built on top of fine-grain message passing involving short messages. Unlike message passing, communication with shared memory is anonymous: there is no explicit recipient of a shared memory access, and processors may communicate without necessarily being explicitly aware of one another.
Owing to the use of cache memories in modern computer architectures, shared memory introduces the cache coherence problem. Cache coherence arises with shared data that is to be written and read. If one processor modifies a shared cached value, then the other processor(s) must get the latest value. Otherwise race conditions will arise, resulting in in non-deterministic behavior. It is desirable for caches to be coherent. NUMA architectures that provide this guarantee are called cc-NUMA (cache coherent NUMA.) Under some circumstances we may tolerate periods of incoherence--to improve performance--so long as we guarantee coherence at certain well defined times. This is referred to as relaxed consistency. Cache coherence arises when shared data is to be written as well as read. If one processor modifies a cached value shared in cache by other processors, then all processors must eventually agree on the updated value.
While cache coherence is a necessary condition to ensure correctness it is not a sufficient one. Coherence says nothing about when changes propagate through the memory sub system, only that they will eventually happen. Other steps must be take (usually in software) to avoid race conditions could lead to non-deterministic program behavior.
Formally, we say that a memory system is coherent if the following 3 conditions hold:
Under certain situations we can tolerate periods of data incoherence--to improve performance--so long as we guarantee coherence at certain well defined times. Just when these times occur is determined by the memory consistency model, which determines when a written value will be seen by a reader. With Sequential Consistency we maintain a linear execution on a parallel architecture, that is consistent with the sequential execution of some interleaved arrangement of the separate concurrent instruction streams. One way of providing sequential consistency is to stall the processor on a write. However, this Stall on Write policy is expensive. (And note that it does not fix an incorrect program. We still require mutual exclusion.)
Weaker consistency models are used to improve performance, and are useful, for example, in dealing with false sharing, when two processors write to different parts of a cache line without interfering with one another. It may be necessary to maintain coherence only at certain times in the program. (This is likely handled under software control.)
Note that with a non-causal consistency model, it is possible for processor P2 to see A=1 and observe B==1 while processor P3 sees B==1 and A == 0 !
P1 P2 P3 A=1 if (A==1) if (B==1) B=1 C=A
There are two major strategies for managing coherence:
The simplest shared memory designs employ a bus:
In either case, P1 must wait for acknowledgment from the P2 before continuing
with the subsequent instructions.
The trouble with the snooping protocol is it relies on an efficient broadcasting capability. While broadcasting is efficient on a bus, busses are not scalable. Modern designs use an interconnection network:
Consider the following scenario. P1 sends a request to P2 to furnish some data. The full time for the request reply can be quite long, depending on the state of memory. What we would like to do is reduce the length of memory latency, or tolerate latency by overlapping processing activity.
The Stanford DASH computer provides the support for such activity via a hardware directory. DASH stands for ``Directory Architecture for SHared memory.'' The 3 goals of this machine are
Thus, memory can be inconsistent with respect to a remotely cached copy, but the memory system is aware of this inconsistency, and can take remedial action. The key to the using the directory is that the owner of a block (called the home) must be consulted on a write or on a remote access.
Each directory entry contains a p-element bit vector describing which processors have a shared cached copy of the block (here we show a simple 4-processor implementation) It also has additional bits saying whether the block is cached or not, and whether it is dirty. If dirty, the PID of the owner is also specified.
P0 P1 P2 P3 +---+---+---+---+---+---+---+ | X | X | X | X | | | |<--- owner PID +---+---+---+---+---+---+---+ {------ 4 ------} ^ ^ All above X's | | 0 or 1 | +-- Dirty Cached
Consider the case where location A is owned by processor P1. (Store A, #1). Now, when P2 sets A to 2 with a Store A #2, it sends P1 a message telling P1 to mark A as "dirty" (step 1) and to issue invalidations of any outstanding shared copies, as indicated in the directory entry. No further requests on the particular block are allowed until all invalidating processors return an acknowledgment to P1, indicating the they have completed the invalidation. If any processor attempts to access A, it will be locked out until P1 has received all acknowledgments.
Store A, #1 step 0| Mark A dirty v Step 1 (P1)<----------------(P2) Store A, #2 ^ ----------------> |step 2 a | Step 2b | | (P3) Load A <-------+
When P1 has received all the acknowledgments it also sets the PID field for A's block to indicate that P2 is the owner of A's block. When P3 tries to access A (step 2a), P1 will forward the request to P2, because the directory determines the owner of A from the directory PID. P2 replies with the data short-circuiting P1. P1 also marks P3 as having a cached copy of the block, however.
Note that a remote cache miss can be expensive due to forwarding, though there will be at most one forwarding agent. Invalidation can be expensive because of the need to received an acknowledgment from all processors that have a copy. However, this is not a common occurrence, and when it does happen usually only a few processors will need to receive invalidation signals. (Comparative memory timings for cache miss, local miss, and remote miss are in the ratio of 2:10:50.)
Release Consistency. To reduce the cost of this overhead, DASH supports Release Consistency. Under release consistency all cache consistency is suspended, until the programmer tells the machine to settle all outstanding inconsistencies with cache values, dirty/clean bits, etc. This permits the programmer to reduce the cost of enforcing consistency, though at the expense of possibly introducing an error. In effect, the programmer must handle the cache. In some cases, such treatment can dramatically improve performance (See the work by the Wisconsin Wind Tunnel Project.)
DASH stalls the cache on a miss. However, later designs enable out of order instruction issue to enable register to register instructions that don't depend on the unresolved memory access to continue.
Copyright © 2008 Scott B. Baden. Last modified: Mon Feb 4 20:35:39 PST 2008