Processor Allocation and Process Migration in Distributed Systems
Let's talk about scheduling in a system. Our discssion is going to involve picking a processor to run a process and moving that process from one processor to another when appropriate. Processor allocation involves deciding which processor should be assigned to a newly created process, and as a consequence, which system should initially host the process.
In our discussion of process migration, we will discuss the costs of associated with moving a process, how to decide that a process should be migrated, how to select a new host for a process, and how to make the resources originally located at one host available at another host.
Although these algorithms will be discussed in the context of processes, task with only one thread, they apply almost unaltered to tasks containing multiple threads. The reason for this is that the interaction of the multiple threads with each other and the environment almost certainly implies that the entire task, including all of its threads, should be migrated whole -- just like a single-thread process. Unlike multiprocessor computers, it only very rarely makes sense to dispatch different threads to different processors, or to migrate some threads but not others. In distributed systems, the cost of sharing resources on different hosts is usually far too high to allow for this level of independence.
An Introduction to Processor Allocation
One interesting aspect of distributed systems is that we can choose upon which processor to dispatch a job. This decision, and the associated action of dispatching the job onto the processor, is known as processor allocation.
Depending on the environment, different factors may drive our decision. For example, many environments consist largely of networks of (personal) workstations (NOWs). In these enviornments, it may be advantageous to "steal" cycles from other uses while they are away from their machines leaving them idle. This is especially attractive since some studies have shown that the typical workstation is idle approximately 70-80% of the time. Of course we would only want to do this if our own workstation is substantially busy -- otherwise we would be paying the price of shipping our job, and perhaps user interaction, &c, both ways and gaining little or nothing. It is certainly forseable that the unnecessary use of a remote processor can increase (worsen) turnaround time.
In other cases, we may have pools of available "cycle servers" and underpowered personal workstations. If this is the case, we can organize our system so that it always dispatches jobs to a remote processor. But, in either case, we want to make sure that we make a careful choice about where we should send our work -- otherwise some poor machine may get smashed.
Transparent processor allocation is different than simple remote execution, as might be provided by something like rhs. The biggest difference is in transparency -- the user need not known that the job is executing other than locally. The second difference is that processor allocation can lead to migration, or the movement of a job after it has begun executing.
A Centralized Approach: Up-Down (Mutka & Livny '87)
The first technique that we'll talk about is named Up-Down. The goal of the Up-Down approach tries to be somewhat fair to users in the way that it allocates processors. It does this by giving light weight users priority over "CPU hogs". Users earn points when their workstation is idle. This is, in effect, credit for allowing others to use their processor. Users lose points when they consume the idle CPU on remote hosts. The points are accrued or spent at a fixed rate. This approach assumes that a user is associated with exactly one workstation (his or her workstation).
When a processor becomes avaialble, it gives it to the requestor with the greatest number of points. This favors those users who are net providers of CPU and penalizes those who are net users. In effect, it ensures that if you only need processor time occasionally, you get it right away. But, if you have used tons of CPU, you yield to those who have been less demanding recently.
Obviously it is impossible for everyone to be a net consumer of CPU, since one can't use more CPU than is available. Everyone can, however be a net supplier of CPU, since it is possible for all hosts to be idle.
It requires some overhead to determine where a process should be run. This overhead can grow quite large, especially if many machines are assumed to be busy. An alternative to the approach described above is to use a hierarchical approach. Instead of assuming the peer-to-peer workstation model as we did above, we are going to assume a model with many "worker" workstations, and a smaller number of "manager" workstation.
We organize these workstations into a tree, with all of the workers as leaves. The leaves are then collected into groups and each group is given a manager. Each group of workers becomes the children of their manager in the tree. These manager's in turn are grouped together and given a directory. These directors have a common parent, which is the root of the tree. One alternative is to use a "board of directors" instead of a single root.
Under this organization, a worker tries to maintain its load between a high and low watermark. If it gets too much work or too little work, it tells its manager. Its manager will then use this information to try to shift work among its workers to properly balance the load. The managers themselves have quotas. If they find themelves with either a shortage or surplus of cycles, they tell their directors, who in turn try to balance the load among their managers, and so on. If the top level is a committee, instead of a signle node, this provides for some level of fault tolerance. If each member of the committee knows everything, it may be possible for a decision to be made, even if one fails.
The goal of this approach is to reduce the amount of information that must be communicated across the network in order to balance the load.
Let's Talk About Loads
If workstations are to cooperate and share work, they must somehow, directly or indirectly, communicate their work levels to each other. Obviously, this leads to a trade-off between perfect information and a tolerable level of communication. We need to have a protocol that will give us "good enough" information.
One nieve approach is to have processors "yell out" to everyone when they are idle. But this approach has a big problem. If a processor "yells out" to all of the other processors, they might all send work its way. The previously idle processor suddenly is heavily loaded. Then, another processor, perhaps one that recently off-loaded work becomes idle, and "yells out". Well, that processor gets slammed with work. If a facility for process migration exists, things get even worse. This probelem is known as thundering herds.
In order to solve the thundering herds problem, we can use a different form of this receiver initiated technique. Instead of broadcasting to everyone, an idle processor can "ask around". As soon as it finds work, it stops asking. If it doesn't find work after asking some fixed number of hosts, it sleeps, while waiting for more of its own work, and tries again after a dormant period, if it remains idle. This approach leasds to hevy communications overhead when the processors are mostly idle.
Another and complimentary approach is known as sender initiated processor allocation. Under this approach a host which notices that its queue of waiting jobs is above some threshold level will "ask around" for help. If it can't find help after asking a fixed number of hosts, it assumes that everyone is busy and waits a while before asking again. As was the case with receiver initiated processor allocation, it is good to poll a random collection of hosts to keep things balanced. Unfortunately, this approach leads to heavy communications overhead when the processors are mostly busy.
Hybrid approaches are also possible. These try to balance the costs of the above two approaches. They only "yell out" if they are substantially overworked or under worked. In other words, a processor won't yell out for help, unless it has a really, really long run queue. And it won't advertise that it has cycles available, unless it has been idle for some time. Typically, under hybrid approaches, processors can operate in both sender-initiated or receiver-initiated modes, as necessary.
Processor Allocation and IPC
When processors are interconnected with each other via IPC, this becomes a consideration for processor allocation. If the processes are cooperating very heavily and cannot make progress without IPC, it might make sense to run them in parallel. Co-scheduling (Ousterhout '82) is one technique for resolving this. It schedules groups of cooperating processes to run in parallel, by useing round-robin style scheduling, and placing the cooperating processes in corresponding time slots on different processors.
Another technique, known as Graph Theoretic Deterministic scheduling builds a weighted graph of all of the processes. The edges are IPC channels, weighted by the amount of communication. The basic idea is that it is best to have processes that require a great deal of IPC running on the same host, so that the latency associated with the IPC is minimized. These approaches work by partioning the graph into one subgraph for each processor in such a way as to minimize the weight of the disected edges.
Although the complexity of these approaches makes them poor choices for real-world systems, it is common place for humans to keep IPC vs. network traffic in mind when making this type fo decision by hand.
A Quick Look At Process Behavior and Scheduling
One observation that proves enlightening for us, over and over again, is that "The recent past is a good indicator of the near future." One instance of this is that, as counter-intuitive as it may seem, in general, in general-purpose systems, recently started jobs are likely to be short lived, whereas long-running jobs are likely to keep running for a very long time.
Why? Well, in general, there are a dramatically larger number of short-lived jobs than long-running ones. Short-lived jobs dominate long running jobs by a few miles. But, the relatrively few long running jobs tend to be really long running. In other words, there are a ton of short jobs, but the long running jobs are really long running. It is an exponential distribution in both planes.
So, if we look at a newly started job, we don't actually know which type it will be: one of the many short-lived jobs or one of the few really long lived jobs. But, with overwhelming likelihood, it is a short-lived job. Having said that, if we look at a job that isn't a short-lived job, as demonstrated by the fact that it has been running for more than a short while -- it is really likely that it is going to be a really long-lived job.
A really, really rough approximation is that we can predict that a job will continue to run as long as it already has, e.g. reflect the past to the future. This is useful, but not statistically quite right -- as we move along the job-length axis, we get fewer and few jobs of larger and larger length.
But, in any case, what is the implication for scheduling? Unless we know better, parodoxically, we predict that a job that we just started will end before one that has run for a long time! In other words, if a host has been running the same job for a long time, we predict that it is "Bogged down", rather than "It is almost done."
Process Migration and Virtual Machines (Or Containers)
If you've got a virtual machines (or Containers) and a distributed file system, you've saved some problems. The VM can migrate, moving the file sessions and internal state with it. But, depending on the VM configuration, it might or might not be able to keep its old IP address.
To keep its old IP address, it has to be a public IP address assigned to it, rather than a situation where it gets a private IP address within a VPN and publicly uses the public IP of the host. It also has to migrate within the same network, so broadcasts to its MAC address still get there. And, lastly, switches and the like need to be made aware of the move, so they don't misunderstand where it is in a bridged link layer and deliver its messages to the wrong leg.
VMs really can help to make migration almost transparent, especially if there exists a DFS that won't become confused by the move and the file interaction is structured as open-read/write-close transactions. But, that brings us to the hard part -- it is the external (network) interaction that is the hard part. And, VMs really can't fix this.
What's The Answer?
To the extent that we can use virtual machines to abstract away the problem, that is fantastic. They might or might not be an option for reasons of efficiency, compatibility, etc. But, to the extent that they are an answer, they are often a nice, clean one.
Beyond that, we've got to hide the migration within lower-level primitives. For example, we don't want our code to be filed with hacks to handle migrating TCP connections. Instead, we want to build a recoverable, migratable communications layer: one that can suspend, update, and resume. This way, our program logic is clean -- and our communications layer can hide the dirt.
HTCondor and TORQUE
There are good tools available to help schedule jobs across workstations, monitor them, and checkpoint their progress. HTCondor and OpenPBS are examples. They are capable of monitoring the load on a pool of systems and distributing jobs to them as they are able. HTCondor can be used to scavange idle cycles form workstations, or for dedicated systems. Unlike other tools, such as OpenMP, MPI, and PVM, they aren't designed to solve the problem of parallelizing or distributing a single large task. Instead, they are designed to allow for the distribution of many tasks over a large number of computers. I've used all three, for example, to to distribute animation and rendering jobs across dedicated high performance clusters, and to scavange cycles from public workstations during low-use times.
HTCondor can schedule precompiled programs, or it can provided improved services, such as checkpointing and migratable I/O for those programs compiled with its libraries. It can also support parallel prcoessing when combined with MPI or PVM(See above).
Much like Condor, TORQUE provides a solution for the distributed scheduling of batch jobs. It also provides fault tolerance in the event of failed nodes,