### Problem 1

1. Please read the section on checkpointing and recovery in Rosenblum's LFS paper that was passed out in class.
In Sprite LFS, segments are written from back to front. For example, below is a diagram showing a segment that has had one write performed on it.

After one more write, the segment could look like this:

A third write would fill the segment (in this diagram, it would write two data/inode blocks and one summary block) and then start filling a new segment.
1. Consider the roll-forward procedure in crash recovery for Sprite LFS. How does the roll-forward procedure determine when to stop rolling forward? If you can't determine this from the paper, then make an educated guess and justify your guess.

2. The segment summary block points to where the next segment summary block will be written. Either that block does not contain a segment summary block (the format would not match) or it does contain a segment summary block. If the latter is true, roll-forward must be able to determine whether it was written before or after the last checkpoint. A simple timestamp can be used to determine this.

3. Sprite LFS wrote the segments as outlined above because it allowed the cleaner and the roll-forward procedure to make the assumption that if a segment summary block was safely written to disk (such as the last block in the segment shown above), then all of the blocks described by the segment were also safely written to disk. Unfortunately, this is based on the assumption that a disk controller writes blocks in the sequence they are stored in memory. In order to reduce rotational latency, some controllers start the write transfer depending on the current rotational position of the head. For example, in the first write shown above, the controller could first write out the last three green blocks, and then the segment summary block, and then finally the first three green blocks.
What impact does this have on recovery? How would you change the way that LFS writes data to disk (the order, the contents, or both) in order to accommodate such disk controllers.

It has a very serious impact on recovery - roll-forward cannot determine whether all of the segment pages were written or not. If they were not, then some of the disk pages will contain garbage. A simple technique to detect such cases is to include in the segment summary block a checksum of the pages in the segment. If the checksum doesn't match, then some pages were not written out.

### Problem 2

Consider the following code that inserts a node into a doubly linked list (using fields prev and next) sorted on the field key. It is a non-unreasonable approximation to what Rio might use to maintain the list of buffers that the recovery code will reference upon rebooting.
void InsertNode (Node* first, last, new) {
Node *at = first;
while (at != NULL && at->key > new->key) at = at->next;
if (at == NULL) {          // Insert as last node in list
new->prev = last;
new->next = NULL;
if (last == NULL) first = new;
else last->next = new;
last = new;
}
else {                     // Insert before at
new->prev = at->prev;
new->next = at;
if (at->prev == NULL) first = new;
else at->prev->next = new;
at->prev = new;
}
}
1. Consider using the Rio cache approach: this routine is used to maintain a data structure that is read after the machine recovers from a crash. How could a machine failure leave this doubly linked-list in an unsafe condition? The forward links and the backward links my not agree. For example, when inserting a node before another one, the prev pointer is updated after the next pointer. Thus, a poorly-timed crash could make it possible for a routine that walks the list backwards to see a different list than one that walks the list forward.

2. Suppose that Rio indeed used such a doubly-linked list. How would you design Rio to deal with the problem? (Note that you have two general choices to consider: either doing something so that the datastructure is not seen in an unsafe condition or doing something to make it consistent). A simple recovery strategy would be, upon recovery, to enumerate the list starting from the value of first. Once the list has been enumerated, the back pointers could be set to the correct value. Invoking the routine below as CorrectList(first, &last) will do this:

void CorrectList (Node * first, Node ** last) {
Node * at = NULL;
*last = first;
while (first != null) {
first->last = at;
at = first;
first = first->next;
}
}

3. The Rio paper does not discuss the impact of either the L1 or L2 cache. The existence of such caches complicate building Rio. Explain why, and how one could change Rio to accomodate such hardware caches. There's no telling when a value from cache will be flushed to memory. Cache is lost when a crash occurs, and so Rio can only count on what is on disk and in physical memory to recover.

Thus, Rio would need to implement some kind of "flush" mechanism to ensure that cache has made it to disk. In practice, this is often done by writing enough memory (and thus loading the caches) to ensure prior writes are complete. Then, since there is no determinable order as to when cache makes it to memory, Rio would need to implement some kind of transactional abstraction. For example, considering the linked list described in this question, Rio could maintain two lists, say list A and list B. There would also be a memory location x whose value would be either A or B: it would have an initial value of A. To update the list, Rio first adds it to the list not indicated by x and flushes the contents of cache. It then sets x to indicate the other list and flushes cache again. Finally, Rio updates updates the list that was not updated. During recovery, Rio reads the list indicated by the value x.

### Problem 3

Design and write a C program that will allow you to determine the size of the file cache in an operating system, assuming (reasonably) that the cache is finite and that it uses an LRU replacement policy. Use your program to see how large the file cache is for some machine. Your answer should include a description of the ideas behind your program, and your analysis that led to your measurement of the file cache size.

Note that if decide to measure the file cache size of ieng9 you are most likely going to thrash the file cach, and you'll be competing with other users for the use of file cache. You need to run your program either on your own personal computer or on one of the individual workstations in EBU2 313, EBU1 3327, EBU3329, or APM B402 so that you are the only user of the computer. The room EBU2 313 usually has quote low traffic, and so you might try those first.

If you use one of the individual workstations, then you may fiind the file bigfile in public/hw2/ useful. This is a 1G file that contains a random number as an ASCII characters string (nonzero terminated).

If you wish your program to measures times, then you might want to use the routine gethrtime. Use man gethrtime for details.

Hint: If you use one of the ieng9 workstations, then there will be two caches: one associated with your workstation and one associated with the NFS server. Be careful to report the size of the first cache (although, if you wish, you could measure the size of each. If you do this, then try to choose a time when that server should be lightly loaded, since otherwise you'll be competiting with other users for the NFS server's cache).

Here's my program. The idea is to read the first p blocks of the file pass times, a block at a time, and measure the average time to read a block. Then, measure the time it takes to read the first block. Starting with a small value of p and running the program twice, the user will see the time needed to read a block when it is in cache (it may not be cached the first time, but it will be the second time). Then, the user can increase p until the time it takes to re-read the first block becomes large. This means that the cache was not large enough to hold p blocks.
#include <stdlib.h>
#include <unistd.h>
#include <sys/time.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>

#define bufSize 8192
#define passes 5

char buf[bufSize];

int main (int argc, char **argv) {
int count, pass;
int nblocks;
int file;
int rstatus;
hrtime_t rt_elapsed[passes+1];
if (argc != 3) {
printf("Need file name and number of MB to read.\n");
exit(1);
}
nblocks = atoi(argv[2]);
if (nblocks <= 0) {
printf("%s is not an integer that is at least 1.\n", argv[2]);
exit(1);
}
nblocks = nblocks * 256;
file = open(argv[1], O_RDONLY);
if (file < 0) {
printf("Could not open file %s.\n", argv[1]);
exit(1);
}
printf("Reading first %s MB of %s.\n", argv[2], argv[1]);
pass = passes;
rt_elapsed[0] = gethrtime();
for (pass = 0; pass < passes; pass++) {
count = nblocks;
while (count-- > 0) {
if (rstatus < bufSize) {
printf("Read returned with status %d", rstatus);
exit(1);
}
}
rt_elapsed[pass+1] =gethrtime();
lseek(file, 0, SEEK_SET);
}
for (pass = passes; pass > 0; pass--)
rt_elapsed[pass] = rt_elapsed[pass] - rt_elapsed[pass-1];
for (pass = 1; pass <= passes; pass++)
printf("The average time to read a block in pass %d was %lld usec.\n",
pass, rt_elapsed[pass]/(1000*nblocks));
close(file);
exit(0);
}

last edited on 11 June 2002 by kam