CSE 124 Building a web server
2017 October 2: Project 1: Building a web server

FAQ

Make sure to check out the P1 FAQ! It has answers to a lot of the common questions we’ve been asked on Piazza and in office hours.

Overview

In this project, you are going to build a simple webserver that implements a subset of the HTTP/1.1 protocol specification (as defined in this document).

Learning objectives

The goal of this project is to build a simple web server that can receive requests, and send back responses, to web clients. During this project you will:

  • Correctly implement a network protocol from a written specification
  • Master the UNIX sockets API
  • Developing a methodology for testing protocol correctness and performance
  • Using git and GitHub.com for managing source code

Logistics

  • Due date: Friday, Nov 3, 2017 at 5pm.
  • Teams: You can work individually, or in a team of two students.
  • GitHub starter code invitation (But please read this post about how to accept GitHub invitations to team projects, even if you plan on working alone, since the GitHub interface is somewhat confusing. If you want to work individually, you still need to create a “team”, however you’ll be the only member of that team.)

Triton-HTTP/1.1 Specification for this project

In this project, we are going to be implementing a subset of the full HTTP/1.1 specification (which is many hundreds of pages long when you consider all the extensions and supplemental documents!). Because our implementation differs slightly from the official HTTP spec, we’re calling it “TritonHTTP.” Portions of this specification are courtesy of James Marshall, used with permission from the author. If you have any questions about what the spec is supposed to be, please refer to this document–don’t go to the actual HTTP spec because it is way more complex than Triton-HTTP.

Client/server protocol

TritonHTTP is a client-server protocol that is layered on top of a reliable stream-oriented transport protocol (i.e., TCP). Clients issue request messages to the server, and servers reply with response messages. In its most basic form, a single HTTP-level request-reply exchange happens over a single, dedicated TCP connection. The client first connects to the server, sends the HTTP request message, the server replies with an HTTP response, and then the server closes the connection:

The HTTP protocol is stateless, meaning that each response is provided without reference to a client’s previous interactions with that server. There is a mechanism called “Cookies” which enables the server to send some state to the client, that the client then sends back to the server the next time it connects, so that it appears like the server keeps state. However, if you delete that cookie, or use a different web browser to reconnect to the server, all that state is lost. Cookies won’t be a part of this project.

Repeatedly setting up and tearing down TCP connections reduces overall network throughput and efficiency, and so HTTP has a mechanism whereby a client can reuse a TCP connection to a given server. The idea is that the client opens a TCP connection to the server, issues an HTTP request, gets an HTTP reply, and then issues another HTTP request on the already open outbound part of the connection. The server replies with the response, and this can continue through multiple request-reply interactions. The client signals the last request by setting a “Connection: close” header. The server indicates that it will not handle additional requests by setting the “Connection: close” header in the response. Note that the client can issue more than one HTTP request without necessarily waiting for full HTTP replies to be returned.

To support clients that do not properly set the “Connection: close” header, the server must implement a timeout mechanism to know when it should close the connection (otherwise it might just wait forever). For this project, you should set a server timeout of 5 seconds, meaning that if the server doesn’t receive a complete HTTP request from the client after 5 seconds, it closes the connection.

HTTP messages

HTTP request and response messages are in plain-text format, consisting of a header section and an optional body section. The header section is separated from the body section with a blank line. The header consists of an initial line (which is different between requests and responses), followed by zero or more key-value pairs. Every line is terminated by a CRLF (carriage-return followed by a line feed). Thus a message looks like:

<initial line, differs between requests and responses>[CRLF]
Key1: Value1[CRLF]
Key2: Value2[CRLF]
Key3: Value3[CRLF]
[CRLF]
<optional body...>

Messages without a body section still have the trailing CRLF (a blank line) present so that the server knows that it should not expect additional headers. You can assume that HTTP requests are not larger than 8KB (2^13). HTTP Responses can be much larger, in some cases many 10s of gigabytes.

Request Initial Line

The initial line of an HTTP request header has three components:

  • The method (in this project that will be GET)
  • The URL
  • The highest HTTP version that the client supports

The method field indicates what kind of request the client is issuing. The most common is a GET request, which indicates that the client wants to download the content indicated by the URL (described next).

The URL is a pointer to the resource that the client is intersted in. Examples include /images/myimg.jpg and /classes/fall/cs101/index.html.

The version field takes the form HTTP/x.y, where x.y is the highest version that the client supports. For this course we’ll always use 1.1, so this value should be HTTP/1.1.

The fully formed inital request line would thus look something like:

GET /images/myimg.jpg HTTP/1.1

Response Initial Line

The initial line of an HTTP response also has three components, which are slightly different than those in the request line:

  • The highest HTTP version that the server supports
  • A three-digit numeric code indicating the status of the request (e.g., whether it succeeded or failed, including more fine-grained information about how to interpret this response)
  • A human-friendly text description of the return code
HTTP Response code semantics

The first digit of the response code indicates the type of response. These types include:

  • 1xx is informational
  • 2xx is a success type
  • 3xx means that the content the client is looking for is located somewhere else
  • 4xx means that the client’s request had some kind of error in it
  • 5xx means that the server encountered an error while trying to serve the client’s request

For this project, you’ll need to support:

  • 200 OK: The request was successful
  • 400 Client Error: The client sent a malformed or invalid request that the server doesn’t understand
  • 403 Forbidden: The request was not served because the client wasn’t allowed to access the requested content
  • 404 Not Found: The requested content wasn’t there

HTTP header key-value pairs

After the intial request line, the HTTP message can optionally contain zero or more key-value pairs that add additional information about the request or response (called “HTTP Headers”). Some of the keys are specific to the request message, some are specific to response messages, and some can be used with both requests and responses. The format of these key-value pairs is inspired by RFC 822 (though again, don’t worry about the 822 spec–just go by what is here).

For this assignment, you must implement or support the following HTTP headers:

  • Request headers:

    • Host (required, 400 client error if not present)
    • Connection (optional, if set to “close” then server should close connection with the client after sending response for this request)
    • You should gracefully handle any other valid request headers that the client sends. Any request headers not in the proper form (e.g., missing a colon), should signal a 400 error.
  • Response headers:

    • Server (required)
    • Last-Modified (required only if return type is 200)
    • Content-Type (required if return type is 200; otherwise if you create a custom error page, you can set this to ‘text/html’)
    • Content-Length (required if return type is 200; otherwise if you create a custom error page, you can set this to the length of that response)

The format for the last-modified header is Last-Modified: <day-name>, <day> <month> <year> <hour>:<minute>:<second> GMT.

The Content-Type for .jpg files should be “image/jpeg”, for .png files it should be “image/png”, and for html it should be “text/html”

A custom error page is simply a human-friendly message (formatted as HTML) explaining what went wrong in the case of an error; custom error pages are optional, however if you use one, the Content-Type and Content-Length headers have to be set correctly.

Project details

Basic web server functionality

At a high level, a web server listens for connections on a socket (bound to a specific adderss and port on a host machine). Clients connect to this socket and use the above-specified HTTP protocol to retrieve files from the server. For this project, your server will need to be able to serve out HTML files as well as images in jpg and png formats. You do not need to support server-side dynamic pages, Node.js, server-side CGI, etc.

Mapping relative URLs to absolute file paths

Clients make requests to files using a Uniform Resource Locator, such as /images/cyrpto/enigma.jpg. One of the key things to keep in mind in building your web server is that the server must translate that relative URL into an absolute filename on the local filesystem. For example, you might decide to keep all the files for your server in ~aturing/cse124/server/www-files/, which we call the document root. When your server gets a request for the above-mentioned enigma.jpg file, it will prepend the document root to the specified file to get an absolute file name of ~aturing/cse124/server/www-files/images/crypto/enigma.jpg. You need to ensure that malformed or malicious URLs cannot “escape” your document root to access other files. For example, if a client submits the URL /images/../../../.ssh/id_dsa, they should not be able to download the ~aturing/.ssh/id_dsa file. If a client uses one or more .. directories in such a way that the server would “escape” the document root, you should return a 404 Not Found error back to the client. Take a look at the realpath() system call for help in dealing with document roots.

Filesystem permissions

After your server maps the client’s request into a file in the document root, you must check to see whether that file actually exists and if the proper permissions are set on the file (the file has to be “world” readable). If the file does not exist, a file not found error (error code 404) is returned. If a file is present but the proper permissions are not set, a permission denied error is returned (error code 403). When a 403 error is returned, no information about the real file should be returned in the headers or body section of the reply (e.g., the file size). Otherwise, a 200 OK message is returned along with the contents of a file.

You should also note that web servers translate GET / to GET /index.html. That is, index.html is assumed to be the filename if no explicit filename is present. That is why the two URLs http://www.cs.ucsd.edu and http://www.cs.ucsd.edu/index.html return equivalent results. You will need to support this mapping in your server.

IP address-based allow/deny rules and enforcement

Some web servers add support for protecting content based on the IP address of the client, such that certain documents are only accessible to clients coming from certain IP addresses. For example, a company might have an “internal” portion of their website that is only accessible to employees who are physically located at the company or logged into the company network via a VPN (virtual private network, a topic we’ll go over later in the term).

To define the access rules based on IP addresses, many web servers, including the Apache server, read from a “.htaccess” file (note the dot in front of the filename). If there is a .htaccess file in a given directory, then the rules in that file apply to that directory. In general, production servers apply the rules from .htaccess files to subdirectories as well, and there are complex rules about how rules in different directories interact with each other and how they are merged together. For this project, we’re going to simply that whole system and only apply rules in a .htaccess file to that directory and none other.

The format of a .htaccess file is a list of lines, each starting with “allow from” or “deny from”. After the from is an IP address range in CIDR format (e.g., xxx.yyy.zzz.www/pp). For more information on CIDR addressing, consult Peterson and Davie, section 3.2.5. For example, “allow from 172.22.16.12/24”, “deny from 121.229.0.0/16”, or “deny from 0.0.0.0/0”. Note that 0.0.0.0/0 is a special address that just means “all IP addresses” (wildcard address). You do not need to handle fully qualified host and domain names–just raw IP addresses in CIDR format. Rules should be applied in descending order.

For example:

deny from 172.22.16.18/32
allow from 172.22.16.0/24
allow from 192.168.0.0/16
deny from 0.0.0.0/0

allows any host in the 172.22.16.0 subnet, except for host 172.22.16.18, to access this page. Hosts in the 192.168 subnet can access the content. Any other hosts are denied by the default rule on the last line.

When a host is denied, it should receive a 403 error message and the content should not be returned, nor should any metadata about the real file (e.g., its file size).

Program structure

At a high level, your program will be structured as follows.

Initialize

We will provide you with starter code that handles command-line arguments, and will call into your C/C++ code with a port and the document root. Note that the document root and port number will be parametera that are passed into your program–do not hard code file paths or ports, as we will be testing your code against our own document root. Also do not assume that the files to serve out are in the same directory as the web server. We will call your program with an asbolute path to the document root that may or may not end in a final forward slash: e.g., “/var/home/htdocs” and/or “/var/home/htdocs/”.

Setup server socket and threading

Create a TCP server socket, and arrange so that a thread is spawned (or thread in a thread pool is retrieved) when a new connection comes in.

Separating framing from parsing

As we will discuss in class, two key operations that must be performed to build your web server are 1) separating out application-level messages by determining when one message starts and another ends (framing), and 2) processing individual messages to understand their meaning (parsing). In your project, you must separate these steps as follows.

You will write code that reads from the client socket and produces HTTPMessage structs (w/ C) or objects (w/ C++). You will then have parsing code that turns an HTTPMessage into an HTTPRequest struct or object.

On the response path (from the server back to the client), the reverse will happen: your code will initialize and fill in an HTTPResponse struct/object, and then your framing code will convert that into an HTTPMessage struct/object, which you will then send over the socket back to the client. Your code must separate parsing and framing into separate steps to receive credit.

Implementation

You may choose from C or C++ to build your web server but you must do it in the ieng6 Unix environment with the sockets API we’ve been using in class (e.g., no HTTP libraries). We should be able to build your code by simply typing “make”. Make sure that your code builds from a fresh clone of your repository. It should be possible for us to perform the following commands to run your server:

$ git clone git@github.com/...
$ cd <your_repo_directory>
$ make
$ ./httpd [port] [doc_root]

You can use implement the threading code with pthread calls directly, or use C++11 style threads. As of the time of writing up this specification, the lab machines don’t support C++ versions newer than C++11.

Grading

The points on this project are assigned as follows:

  • 94% is based on the correctness of your web server code
  • 6% is based on the quality of writing of your code

Make sure your code works on the ieng6.ucsd.edu servers.

This project can be done individually, or in a team of two students.

Correctness (quantitative)

Basic functionality for 200 error code responses (50 pts)

  • This category represents error-free, valid requests that result in a 200 error code. Your server should correctly handle valid GET requests for HTML, JPEG, and PNG files.
    • The response headers should be set correctly
    • The response body should match the content
  • You should support directories and subdirectories
  • http://server:port/" should be mapped to “http://server:port/index.html"

Pipelining (15 pts):

  • Your server should support pipelining of requests and responses
  • If a client includes a “Connection: close” request header, then the server should give its response then close the connection, and not handle any additional requests
  • Your server should implement a 5 second timeout, so that if a client has not included a “Connection: close” header in their most recent request, but hasn’t issued a new request in over 5 seconds, then the server should close the connection.

Basic functionality for non-200 error code responses (10 pts):

  • Handles 403 for files that aren’t ‘world readable’
  • Handles 404 for files that aren’t found
  • Handles 404 for URLs that escape the doc root
  • Correctly handles malformed HTTP requests by issuing a 400 error

Security (15 pts):

  • Correctly handles .htaccess allow/deny rules

Concurrency (10 pts):

  • Your server should be able to handle concurrent clients using threads

Autograder

On ieng6, in the classes’s public directory, we’ve provided an autograding program that we’ll use to evaluate your project. We have populated it with a few of the basic test cases so that you can ensure that you’re on the right track. There are many additional test cases we have not provided to you that we’ll use in testing the final project.

In the directory ~cs124f:

.
|-- public
|   `-- project1
|       |-- bin
|       |   `-- cse124HttpdTester
|       `-- htdocs
|           |-- LICENSE
|           |-- index.html
|           |-- kitten1.jpg
|           `-- subdir1
|               `-- wolves.jpg

bin/cse124HttpdTester is the autograder, and it should be tested against the provided htdocs folder.

Quality (qualitative)

We will evaluate code quality using the following rubric.

Does not meet expectations (0 pts) Meets expectations weakly (1 pts) meets expectations strongly (2 pts) Outstanding (2 pts)
Readability Code lacks structure; a reader must spend considerable time and effort to undertand where functionality is and how parts of the code relate to each other Code has structure, variables and functions are largely descriptive, layout allows a reader to gain understanding where functionality is and how different portions of the code relate to each other after some effort Code is well structured, indented, variables and functions are descriptive, layout is highly conducive to quickly understanding where functionality is and how different portions of the code relate to each other Code serves as a reference example for others
Modularity The codebase lacks modularity. The implementation of different tasks and functions are intermixed. Code is divided into well-formed, separate modules. Code is divided into modules that can be developed independently of each other. A reader can gain insight into the design and function of the program through its structure. Code is divided into loosely coupled modules that can evolve separately, can be independently tested and reasoned about, and be developed independently of each other. These modules also serve as implicit documentation on how the program works and shed light on the overall design of the system.
Efficiency Algorithms, data structures, and code structure is designed and implemented in a way that uses up excessive resources and is subject to severe performance penalties. For this project, issuing sockets calls one byte at a time will be considered in this category. Algorithms, data structures, and code structure is designed and implemented to meet specifications in a way that does not needlessly or exceptionally use up resources or hinder performance Algorithms, data structures, and code structure is designed and implemented to meet specifications in a way that largely balances readability, modularity, testability, evolvability, and performance Algorithms, data structures, and code structure is designed and implemented to meet all specifications in a way that ideally balances readability, modularity, testability, evolvability, and performance

Hints and reminders

  • Make sure to separate framing and parsing in your code to receive credit

  • To handle “..” in the filepath, take a look at the ‘realpath’ function (you can type “man realpath” to learn more)

  • The recv() system call may not return a full HTTP request, meaning that multiple recv() calls are needed for the server to read in a complete request. On the other hand, if the client issues two back-to-back HTTP requests, it is possible that the server issues a read() call and both requests are returned by that call. For this reason, you must ensure that you frame/unframe HTTP messages based on the protocol, not based on what recv() returns.