CSE 124 : Winter 2015 : Assignment 1 : Build your own webserver

The goal of this project is to build a functional HTTP/1.0 server. This assignment will teach you the basics of distributed programming, client/server structures, and issues in building high performance servers. While the course lectures will focus on the concepts that enable network communication, it is also important to understand the structure of systems that make use of the global Internet.

This project should be done in teams of two.

Due Date: February 3, 2015, before start of class

At a high level, a web server listens for connections on a socket (bound to a specific port on a host machine). Clients connect to this socket and use a simple text-based protocol to retrieve files from the server. For example, you might try the following command from a UNIX machine:


$ telnet www.cs.ucsd.edu 80 
GET /index.php HTTP/1.0\n 
\n

(type two carriage returns after the "GET" command). This will return to you (on the command line) the html representing the "front page" of the UCSD computer science web page.

One of the key things to keep in mind in building your web server is that the server is translating relative filenames (such as index.html) to absolute filenames in a local filesystem. For example, you might decide to keep all the files for your server in ~student/cse124/server/files/, which we call the document root. When your server gets a request for index.html , it will prepend the document root to the specified file and determine if the file exists, and if the proper permissions are set on the file (typically the file has to be world readable). If the file does not exist, a file not found error is returned. If a file is present but the proper permissions are not set, a permission denied error is returned. Otherwise, an HTTP OK message is returned along with the contents of a file.

You should also note that web servers typically translate "GET /" to "GET /index.html". That is, index.html is assumed to be the filename if no explicit filename is present. That is why the two URL's "http://www.cs.ucsd.edu" and "http://www.cs.ucsd.edu/index.html" return equivalent results.

When you type a URL into a web browser, it will retrieve the contents of the file. If the file is of type text/html, it will parse the html for embedded links (such as images) and then make separate connections to the web server to retrieve the embedded files. If a web page contains 4 images, a total of five separate connections will be made to the web server to retrieve the html and the four image files.

For this assignment, you will need to support enough of the HTTP protocol to allow an existing web browser (Firefox, Chrome, or Safari) to connect to your web server and retrieve the contents of a web page with HTML code, as well as with in-line images, including jpg and png formats. One example you can use is the UCSD CS front page. Of course, this will require that you copy the appropriate files to your server's document directory. You will not need to have any support for the php aspects of the page.

At a high level, your web server will be structured something like the following:


Forever loop: 
Listen for connections 
    Accept new connection from incoming client 
    Parse HTTP/1.0 request 
    Ensure well-formed request (return error otherwise) 
    Determine if target file exists and if permissions are set properly (return error otherwise) 
    Transmit contents of file to connect (by performing reads on the file and writes on the socket) 
    Close the connection

You will have three main choices in how you structure your web server in the context of the above simple structure:

  1. A multi-threaded approach will spawn a new thread for each incoming connection. That is, once the server accepts a connection, it will spawn a thread to parse the request, transmit the file, etc.
  2. A multi-process approach will fork() a new process to handle an incoming web request. This approach is largely appropriate because of its portability (relative to assuming the presence of a given threads package across multiple hardware/software platform). It does face increased context-switch overhead relative to a multi-threaded approach.
  3. An event-driven architecture will keep a list of active connections and loop over them, performing a little bit of work on behalf of each connection. For example, there might be a loop that first checks to see if any new connections are pending to the server (performing appropriate bookkeeping if so), and then it will loop overall all existing client connections and send a "block" of file data to each (e.g., 4096 bytes, or 8192 bytes, matching the granularity of disk block size). This event-driven architecture has the primary advantage of avoiding any synchronization issues associated with a multi-threaded model (though synchronization effects should be limited in your simple web server) and avoids the performance overhead of context switching among a number of threads. Generally speaking event-driven implementations will entail significantly more effort than multi-threaded or multi-process designs, so be aware of this before choosing an event-driven design.

You may choose from C or C++ to build your web server but you must do it in a Unix-like environment. You will want to become familiar with the interactions of the following system calls to build your system: socket(), select(), listen(), accept(), connect(). We outline a number of resources below with additional information on these system calls. A good book is also available on this topic.

Note that the previous discussion assumes the HTTP/1.0 protocol. Next, add simple HTTP/1.1 support to your web server, consisting of persistent connections and pipelining of client requests to your web browser. You will also need to add some heuristic to your web server to determine when it will close a "persistent" connection. That is, after the results of a single request are returned (e.g., index.html), the server should by default leave the connection open for some period of time, allowing the client to reuse that connection to make subsequent requests. This timeout needs to be configured in the server.

Project 1 Submission guidelines

You will be submitting a compressed tar archive named webserver.tar.gz containing your source code and a Makefile. Please do not submit any binary files. Instructions for how to submit your code will be posted on Piazza soon.

Your Makefile should build a binary named httpd, which should take two command line arguments: a port (an integer) and a path to the document root (a string). Please verify that the following commands build and your start your webserver:


tar -xzvf httpd.tar.gz
make
./httpd <port> <path/to/document/root>

Extra credit opportunity

Provide simple server support for ".htaccess" files on a per-directory basis to limit the domains that are allowed access to a given directory. You only need to implement the "allow/deny from 000.000.000.000/xx" syntax and rules should be applied in descending order. You should be able to allow/deny from both ip addresses as well as domain names. .htaccess rules do not have to be applied recursively. An example of a .htaccess file includes:


deny from 172.22.16.18/32
allow from 172.22.16.0/24
allow from 192.168.0.0/16
allow from mymachine.ucsd.edu
deny from 0.0.0.0/0
The above file allows from any host in the 172.22.16.0/24 subnet, except for host 172.22.16.18. Any 'local' addresses in the 192.168 subnet are allowed, as is 'mymachine.ucsd.edu'. Any other hosts are denied (the default rule on the last line).

Grading

  1. Basic Functionality (50%)
    1. Accepts well-formed request for a static, text-based HTML page in the "web root" directory and returns the correct response (i.e., 'text/html') and content (i.e., returned content matches the on-disk file).
    2. As above, but for binary JPEG files. Content-Type should be 'image/jpeg'
    3. As above, but for binary PNG files. Content-Type should be 'image/png'
    4. text/html, image/jpeg, and image/png are the only content types that you must support for this project
    5. Accepts well-formed request for content in a subdirectory (e.g., /images/foo.jpg)
    6. If your server is given a malformed request, it should not crash, but instead return a '400' response code
    7. If your server is given a request for a file that does not exist, it should return a '404' not found error.
    8. Requests for a directory have 'index.html' appended to them (e.g., "http://server/" translates to "http://server/index.html")
  2. Concurrency (20%)
    1. While your server is processing a request from Client 1, it should be able to accept and process a request from Client 2 (and 3, 4, ...)
  3. Security (10%)
    1. A client should not be able to "escape" the document root (e.g., by requesting the URL 'http://server/../../etc/passwd' for example)
    2. If a file is not world-readable, the server should return a '403' error
  4. HTTP/1.1 Support (20%)
    1. Basic HTTP/1.1 functionality (e.g., ability to make pipelined requests)
  5. Extra credit (10%)
    1. Implementing .htaccess functionality as described above.

Helpful resources

There are a variety of resources online that can be of help while you're working on this project. One such resource is:
  1. HTTP Made Really Easy
    1. http://www.jmarshall.com/easy/http/
  2. Eddie Kohler's guide to using Git
    1. http://cs61.seas.harvard.edu/wiki/2012/Git

Revisions

  1. January 9, 2015
    1. Added the time that the project is due
    2. Added the 'helpful resources' section
  2. January 11, 2015
    1. Corrected the explanation of the .htaccess security rule format