Demand Paging

Supporting a Virtual Address Space Larger Than Physical Memory

Supporting a virtual address space larger than the physical address space requires what is called a backing store. A backing store is storage large enought to hold all of the data that could fit into the virtual address space. The backing store is usually a secondary storage device like disk.
When we consider memory from this perspective, we find that smaller, faster, and memore expensive storage devices are actually being used as a cache for larger, slower, and less-expensive device. This model for a system's memory is called the memory hierarchy.

Overlays

The oldest form of virtual memory is called the overlay. With this technique, we could thing of a program's memory use as involving several separate, self-contained parts. These parts, called overlays can be swapped into and out of main memory one at a time. This swapping is handled by a small amount of "glue" code called the overlay driver. Programming with overlays requires careful design from the programmer as well as special linking support from the compiler. Recovering Turbo Pascal programmers might remember using overlays to "break the 640K barrier".
Overlays are no longer used in general purpose systems. This is because of the high maintenance costs associated with code changes (that changes the size of the code) and the development of more sophisticated virtual memory techniques.

Swapping

Swapping is another technique that involves replacing the entire address space each time a task context-switches. This technique is transparent to the user, but the I/O cost is trememndous.
The goal should be to preserve the transparency, but to improve the efficiency.

Demand Paging

Demand Paging is like simple paging and swapping all rolled up into one. We consider the page to be the unit of I/O. Instead of swapping all of the pages at once when a context-switch occurs, we defer loading or storing any page until it becomes absolutely necessary.
The idea is this. If we attempt to access a virtual address that lies within a page in memory, we win. If not, this is a page fault When a page fault occurs, we load the page into main memory from the backing store. If there isn't enough available main memory to load the page, a page in memory is first written to the backing store.
Demand paging requires several types of hardware support:

A TB and an address translation mechanism.
Page table entries with disk addresses (can be calculated from offset)
The ability to detect a page fault
Restartable instructions

Steps In handling A page Fault

Select the victim
Write the victim to the disk ("page out")
Read new page from disk ("page in")
Clean up: set up page table entries, flush/fix TB

How Do We Select The Victim?

There are two philosophies that govern which frames are fair game:

Frames are a global resource: select from any frame
One process shouldn't affect another: only swap own frames
Regardless of the bounds on our selection, we need to decide among the available frames. This decision process is called the Replacement Algorithm.

OPT -- The Optimal Replacement Algorithm

The ideal replacement algorithm is Optimal (OPT). OPT looks ahead into the future and selects the frame that will remain unused for the longest time -- this page is selected for replacement. Unfortunately, clairvoyance isn't something available to most systems. This would require not only an intimate knowledge of the program itself, but also knowledge of every piece of data, user input, exception, &c. If the course of a processes execution could be known in advance, we wouldn't need to execute the program (the ultimate in branch prediction). OPT is a good reference for comparison -- but not a real world solution.

LRU -- Least Recently Used

Okay, so OPT is impossible, so what can we do? One feasible approximation of OPT is Least Recently Used (LRU). LRU is based on the observation that the recent past is a good predictor of the near future. As a consequence, it selects the page that was accessed least recently as the victim.
While this is generally a good approach, it isn't perfect. For tasks, garbage-collection, for example, the recent past couldn't be a worse predictor of the future. Once a garbage collector accesses a page, it won't return to it in the same pass through memory.

How do we know LRU's choice?

We keep a list of all frames. Every time a frame is accessed, we move it to the head of the list. Thus the head of the list becomes the "hot end" and the tail of the list becomes the "cold end." When a page needs to be replaced, we select the victim to be the last page on the list -- it is the least recently used.

Reality Strikes Again: Approximating LRU

While linked lists are the workhorses of software, manipulating points for each memory reference is far too expensive for hardware. So in practice we approximate LRU. Many approximations exits. The goal is for the OS to do most of the work, while the hardware provides limited support.
The basic idea for most approximiations is to maintain approximate usage information and to manipulate the list periodically, not constantly.

LRU Approximation: Simple Buckets

Many processors provide a reference bit as part of the hardware supported page table structure. The hardware sets this bit each time the page is accessed.
One simple approach is to look at this bit only periodically. Initially, all of the bits are 0. The bits are then set as pages are accessed. At the end of the interval, the OS can reorganize the list into categories of pages (buckets): referenced this interval, not referenced for one interval, not referenced for two intervals, &c. At each pass, if the bit is not set, the page moves one category back. If it was referenced, it moves to the top category for pages referenced during the current interval. As it reorganizes the list, the OS clears the bits.

The Clock Algorithm

Another way of using this bit is known as The Clock Algorithm. The clock algorithm maintains a circular list of pages and a pointer into this list. When a victim is needed, the pointer advances until it finds a page with a clear reference bit. As the pointer advances, it clears the bit for the page. This ensures that a victim will eventually be found. In the worst case, a victim will be found at the beginning of the second pass through the list -- this page's bit was cleared when the page was processed during the first pass.

The Modified Bit

In addition to the reference bit, some hardware provides a modified bit. The modified bit is set if the contents of a page are changed. It makes sense to prefer a victim which is "clean" (unmodified) to a a "dirty" page. This is because a "clean" page requires only a page-in, whereas the "dirty" page requires both a page-in and a page-out. Almost any algorithm can be modified to prefer a clean page -- although it might take extra work.

Prepaging and Victim Pools

The performance of the VM system can often be improved by prepaging. If for example the system should observe that pages are being read in order, it might be wise to try to stay "a step ahead" and keep a few pages ahead of the requests. Often times after a certain number of sequential requests, operting systems will begin to prepage. By bringing the pages in ahead of the request, the OS can prevent the process from blocking to wait for the page. Of course, unnecessary prepaging can force useful pages out of memory, reducing the performance of the system.
Another common technique is to take advantage of lulls in I/O and to write out dirty pages in advance of a request. For example, it might make sense to keep a pool of "victims" clean in order to ensure that the request for a page can be satisfied without only a page-in and without a page-out. This can, in effect, cut the penalty in half.
Of course if constantly changing pages are selected as "victims", it will do little good -- the pages will be dirty again almost as soon as they are written. If this is the case, the policy has done little more than create work for the I/O system. As a result, "victim" selection here follows the same guideless as on-demand victim selection. The only difference is that these "victims" are still in memory and can be rescued if they are used before they are overwritten.

Approximating LRU Without Hardware Support

If the hardware does not support the R bit, it can be emulated. Initially, all pages are marked for no read access and no write access. When an access is attempted, a page-fault will occur. When the trap occurs, the OS can set the bit in the page table and reset the access mode on the page. If the pages can not be help with no access rights, we can ensure that the page faults the first time and use the address translation mechanism to makr the R bit in software. In effect, this is using the protection or address translation mechanism as a "tripwire."

Every time interval, examine the TB entries. If a page is in the TB, place it into bucket 0 (accessed this interval). Otherwise move them back one bucket in the list.

Flush the TB

As pages are accessed, insert them into the TB for the first time.

FIFO

Another, approach is to approximate LRU using a FIFO queue. Basically, we replace pages in the order in whcih we brought them in --independent of how often or how recently they were used.
This approach is apealing, because it can be implemented without hardware support and without any black magic. In fact, it is reasonably easy to implement in software. But it's performance is worse than better approximations of LRU (which are in turn worse than LRU itself, which is worse than OPT).

Belady's Anomaly

FIFO also suffers from an ailment known as Belady's Anomaly. Although this may sound counter intuitive, adding more physical frames and decreasing frame pressure, can actually lead to a higher page fault rate! It has been shown that any algorithm that does not have the stack property can suffer from this surprising phenomenon.
The stack property is so named because it can best be observed when using the stack model for implementing LRU. The import aspect of the stack property is that every page which is in memory if N frames are available is also in memory if (N+1) frames are available. In other words, if we take the trace of any program and examine memory when executed with N frames and again with (N+1) frames, at every corresponding point in the trace, the pages that were in memory during the first run are in memory during the second run. If you play with FIFO, you will find that this is not necessarily the case.

Performance of Demand Paging

The performance of demand paging is often measured in terms of the effective access time. Effective access time is the amount of time it takes to access memory, if the cost of page faults are amortized over all memory accesses. In some sense it is an average or expected access time.
ea = (1 - p) * ma + p*pft
ea = effective access time
ma = physical memory (core) access time
pft = page fault time
p = probability of a page fault occuring
(1-p) = the probability of accessing memory in an available frame
The page fault time is the sum of the additional overhead associated with accessing a page in the backing store. This includes additional context switches, disk latency and transfer time associated with page-in and page-out operations, the overhead of executing an operating system trap, &c.

Thrashing

A system is said to be thrashing if it spends more time handling page faults than it does performing useful work.
Too much multiprogramming is the cause of thrashing. When too many processes run, there are not enough physical frames available for each process to meet its immediate demand. As a result, the system thrashes.
This phenomena can be cause by poor long-term scheduling. Consider a long term scheduler that seeks to increase CPU utilization by increasing the level of multiprogramming when the CPU is underutilized. Now consider this system if most processes are blocked waiting for their pages to be loaded. Since CPU utilization is low, the long-term scheduler will admit more processes. This will increase the demand for frames and as a consequence more resources will be squandered by thrashing. Since mroe processes are blcoked, CPU utilization appears low, so the long-term scheduler will dispatch more processes, and the vicious cycle repeats. The end result is a drastically higher memory access time.

Steeling Frames?

When implementing demand pages, we can consider physical frames to be either a per process or system-wide resource. If we consider frames to be a per process reosurce, each process gets some number of physical frames. When the process must page in a new page, one of its pages must be written out. When frames are considered to be a system-wide resource, less-active frames from other processes may be selected for the page out.
Using a system-wide policy is generally more efficient (reduces access time), because it considers more frames in selecting the least likely frames to be needed again soon. But this policy fails to control thrashing. If frames are considered to be a per process resource, it is imposisble for the level of multiprogramming to reduce the number of frames allocated to a running process.

Program Behavior

One key factor in determining a program's performance is determining the number of page faults. The number of page faults is a function of several factors:

the program size (virtual memory)

the size of physical memory

the quality fo the replacment algorithm

the program's behavior.

Given demand paging, a program's behavior can be characterized by it's working set. A small working set is synonymous with good behavior. For a given memory size, a larger working set will yield more page faults.
Working Set (T) = the number of distinct pages referenced by a task in time T.
With a large enough page size, the system will do nothing but handle page faults. This is called thrashing. There are only two solutions:

add more memory

block processes until others can complete

A small working set implies high temporal locality and high spatial locality. Spatial locality is a measure of how close to each other memory accesses are. High spacial locality means that if an address is accessed, another nearby address will be accessed very soon. High spatial locality is often associated with accesses to code, arrays, &c. Temporal locality is a measure of the amount of time betwen accesses to the same page. High temporal locality implies that if a page has been accessed the same page will be accessed again very soon. High temporal locality is often associated with loops or recursion.
Another way to thing about a working set is this:
We can consider the size of the working set to be the number of distinct pages a process references within the last delta memory references. We call delta the working set window. If a proces access many pages, it has a large working set, indicating poor locality. If a process access very few pages in the same number of references, we say that it has a small working set, indicating strong locality.
The trick to this type of analysis is that delta must be set appropriately. If delata is too big, we will consider too many memory references and our analysis wil cross several localities -- parts of the process that use different part of memory. But if delta is too small, we won't capture all of the pages needed in the locality.

Page Demand

Under the working set model, a system's demand for pages equals the total number of pages in the working sets of all processes. The system's level of multiprogramming can be increased as long as this demand can be met. If there are fewer frames available than are demanded, processes should be suspended to free allocated frames and prevent thrashing.
Prepaging is the policy of remembering a process's working set when it is suspended or interrupted and then restoring these pages ot physical memory when the process is dispatched. The effectiveness of prepaging depends on the overhead and locality -- it can be a win or lose proposition.

Page Fault Frequence

Another approach to balancing the level of multiprogramming is to directly montior the frequency of page faults. If there are too few page faults, the level of multiprogrammng should be increased. If there are too many, processes shoud be suspended.