At the system-call level, most systems today implement something similar to the semantics of POSIX (Unix):
We'd like our files to persist after an operating system reboot, so we put them somewhere safe: most commonly today, we use a hard drive. While most of a computer is purely electronic, a hard drive is also mechanical; this makes it orders of magnitude slower in performing some operations. The important characteristics of a hard drive for our purposes are:
Most Unix filesystems, and many non-Unix filesystems too, have a fairly simple structure, made up of several key elements:
Question: Given these data structures, what disk accesses would be required when executing the following code, assuming no data is in cache?
char buf[32];
int fd = open("/usr/include/stdio.h", O_RDONLY);
read(fd, buf, 32);
close(fd);
Because seeks on disks are slow, we often want to try to batch up disk operations as much as possible, so we can reorder or combine operations for better efficiency. When performing disk writes, one good way to do this is to use a write-back cache (as opposed to a write-through cache). As we modify files, we don't write modified blocks to the disk immediately. Instead, we keep the data in memory, in a cache, and write the data out to disk somewhat lazily (but usually within some window of time, such as 30 seconds, to ensure that data does stay in memory for too long).
The order in which we write out blocks to disk may be quite different than the order in which the changes were made. This opens up the possibility for all sorts of problems if the computer crashes while some data has not yet been written. When the computer reboots and reads the filesystem, it will find that the filesystem is in an inconsistent state.
Question: What types of filesystem problems can occur if the system crashes before all filesystem modifications have been written? We usually assume that the disk is capable of ensuring that complete blocks get written, so we don't have a problem with part of a block being written.
We'll divide bytes on disk into two categories: data and metadata. Metadata is data about data: inodes, directories, indirect blocks—everything that isn't user data but is used to locate user data on disk. After a sytem crash, there may have been writes to both data and metadata that did not finish. However, from a filesystem point of view, we're generally most concerned with inconsistencies in filesystem metadata. (Why?)
Often after a crash, a filesystem checker (fsck on Unix) will run, analyzing disk contents and trying to recover from any partial writes of data, at least to the point of making the filesystem consistent again.
There are various strategies we might use in our filesystem to try to keep data on disk consistent or make recovery easier.
Pray. ext2 on Linux takes this approach by default. Don't do anything special; hope that data is not too corrupted after a crash, and use a filesystem checker to try to fix things up.
Synchronous metadata writes. Used by FFS, or ext2 with the appropriate flags. Anytime we write metadata to disk, make the write synchronous—that is, don't put it in the cache to be written later, and instead write it out to disk right then, and wait for the write to finish. This actually can't eliminate all corruption, but can make the window during which a crash causes problems short. It also kills performance.
Soft updates. Found in FreeBSD. This is a clever approach to the filesystem consistency problem, and works by carefully ordering when writes are sent to disk, so that the data structures on disk are always consistent. In doing so, it can avoid the performance penalty of synchronous metadata writes. (It can actually still leave a few problems—it might cause disk space to be lost after a crash—but this is easy to fix up any time after a reboot.)
Journaling filesystems. Include many modern filesystems, such as ext3, Reiserfs, and NTFS. All operations are written to a journal (or log) on disk before they are performed. The operations are not actually started until the data is safely in the log. If a crash occurs, the log contains enough information to finish whatever operations were in progress, so that the filesystem is made consistent again. However, now all operations require writes to two places (the log and the actual data structures).