Jamison Collins (indecent_and_obscene@yahoo.com)
Tue, 30 May 2000 04:14:04 -0700 (PDT)

This paper discusses methods to implement distributed
shared virtual memory on loosely
clustered workstations. Supporting this type of
memory system would enable program
parallelization with reduced program rewriting, as
well as supporting a more dynamic
communication model--in a message passing system,
without extensive coding, the communication
patterns are fixed regardless of the details of the
dataset. There had been little previous
work in this area--most work targetted shared bus

This paper presents several approaches to maintaining
coherency in the face of reads and
writes. This is examined for both a single
centralized manager and a distributed manager.
Through mathematical evidence, a distributed approach
is determined to be the best.

To validate these approaches, 4 microbenchmarks were
analyzed. While perhaps not true
microbenchmarks, the programs were a far cry from try
scientific computing applications.
And, as expected, there is one benchmark which shows
the "surprising" superlinear speedup,
one which does very poorly, and two which do well(just
kidding... I know this is early
work :P). The superlinear speedup is attributed to a
vast reduction in the number of
pagefaults due to the aggragate increase in the amount
of available ram throughout the NOW.

The main question i ahve about this paper is about the
performance of the network links
that were used. I assume that is covered in infamouse
reference [34]. But I would assume
that the latency was very high in order to justify the
granularity of sharing being on
1k boundaries.

Implementing Global Memory

Advances in network and processor technology have
changed LANs. We now have the capability
of viewing collections of machines, while loosely
coupled, more cooperatively. As was seen
in the previous paper, the user of remote memory to
minimize page faults can have very
beneficial effects on overall performance--this paper
explores intentionally exploiting
this effect through the idle memory of machines in a

Memory present on a node can be in two primary
states--local or global. Local memory is
memory in use by applications present on the node and
global memory is memory that is being
temporarily stored on the node which belongs to an
application on a different node.
The paper then discusses the exact implementation of
memory management strategy that is used.
THe kernel was modified as was the TLB handler to
provide the necessary information to the
algorithm. Specifically, LRU information is difficult
to obtain on the specific processors
that were used.

The approach is evaluated through 3 benchmarks. The
results show that performance is grealty
improved through this memory structure, by as much as
a factor of 3. Additionally, the
modifications were found, in general, to have minimal
negative impact on the network. The
one exception to this is when the idle memory on one
node is heavily contended for, the
cpu usage on that node greatly increases. Of course,
the aggregate performance still greatly
improves, but performance on that single node will be

I hadn't realized that an approach such as this would
yield such large performance gains.
It is a great demonstration that a dynamic method for
resource distribution can
significantly outperform any static partition. One
thing I'd be curious to see would be
to try to setup some sort of head to head setup
between an Origin 2000 or similar machine
and a NoW as described in this paper.

I have often wondered... what if we discovered
that we were threatened... by an outer power,
from outer space... from another planet?
Wouldn't we all of a sudden find that we
didn't have any differences between us at all?
That we were all human beings?

Do You Yahoo!?
Kick off your party with Yahoo! Invites.