CSE 124: Networked Services

Assignment 3: Distributed Search Engine

Introduction

Search has become an indispensible aspect of our lives today. In this assignment, we will focus on the problem of building a scalable web search backend. Assume that you are provided with a large inverted index. The problem is to efficiently answer queries against this inverted index.

For scalability, we want to use a cluster based architecture for the system. A naive implementation would be to replicate the entire index across each node in the cluster. However, this quickly becomes unscalable and inefficient as the size of the index increases.

We therefore want to partition the index across the nodes. At the same time, we need to do some replication to deal with node failures. The search engine must support multi-word queries (so to answer the query, multiple nodes might need to be contacted).

For this assignment, you will be provided with the skeleton code for a fully functional distributed search engine. There is a central "master" and several "index servers". Clients always talk to the master. Each index server registers with the master, and notifies it of the index range it is responsible for. On receipt of a query, the master contacts the index servers, joins the results and returns them back to the client. We use XML-RPC as the transport mechanism (we have used the XMLRPC-C library).

Due date: December 7, 2007

IMPORTANT: If ALL of the project goals are met, you will receive up to 25% extra credit.

Enhancements

Better replication and partitioning

Each index server takes as input a directory name. It expects two files in the directory: range and index. The index file simply contains an inverted index. Each line contains a word followed by the list of documents it occurs in. Fields are tab separated. The range file contains the range of terms being served by this server. Following are examples of valid ranges:


    ^
    $
    
This serves the entire index.

    begin
    end
    
This serves everything from the word begin (inclusive) till the word end (exclusive).

This provides the basic mechanism for partitioning. However, it is up to you to decide exactly how you want to partition your data and how you want to replicate it across different servers. Your write up should include a description of your rationale and the implementation.

Further, the provided code broadcast the search terms to ALL servers. Implement a smarter scheme, where a search term is dispatched to a server only if it lies within the range registered by that server. Take replication in consideration as well: if there are multiple replicas for some part of the index, you might want to send the query only to one of the replicas.

Intelligent retry of failed queries

The provided code very basic failure handling. If a query fails at ANY of the index servers, the whole process is restarted. Implement a more intelligent scheme where only the failed terms are re-queried, and not all of the search terms. Failed terms might need to be redirected to their replicas, if the primary has failed. Handle the case where none of the replicas are alive.

Failure detection

If an index server fails, the master does not detect this until it receives a search query and tries to contact the index server for it. Implement better failure detection by using a heartbeat mechanism: each index server periodically pings the master server. If the master doesn't hear from an index server for some time, it declares it as dead. Further queries for that index server should be directed to a replica. Handle cases where the index server subsequently comes back up.

Ranking of search results

Each index server returns a list of documents the term occurs in, as well as the number of occurances. The current ranking function is by no means optimal: it orders results by the number of occurances of the term. Implement a better ranking function. Some suggestions:

  • Take into account the number of documents a term occurs in
  • For multi term queries, documents that contain more than one term might be ranked higher. In general, intersection should give higher quality hits than unions.

Testing

One of the biggest challenges in building robust distributed systems is testing and debugging. To help you get started, we are providing a basic test harness. It will also give you some idea of the kind of test infrastructure your code will be expected to run on, and the kind of tests it will be subjected to. Feel free to modify/enhance according to your needs.

The test script can be invoked as:

    
    $ ./run-test.sh config masterport queries.list
    
    
where,
  • config is a file containing the configuration of the system. Each line contains the hostname of the index server, the port it should be run on, and the directory is should serve. Fields are space or tab separated.
  • masterport is the port number for the master
  • queries.list is a file containing the search queries. Each line represents a single query. So multiple words on a line will be treated as a multi term query.

Sample config file


    1.group20.apking.usher.ucsdsys.net 8090 /home/diwaker/cse124/lab3/src/test/sample.search/part-1
    5.group20.apking.usher.ucsdsys.net 8090 /home/diwaker/cse124/lab3/src/test/sample.search/part-2
    3.group20.apking.usher.ucsdsys.net 8090 /home/diwaker/cse124/lab3/src/test/sample.search/part-3
    4.group20.apking.usher.ucsdsys.net 8090 /home/diwaker/cse124/lab3/src/test/sample.search/part-4
    
    

For the master port, just choose any unique number greater than 1024, and preferably greater than 5000. For example, 8080.

Sample queries.list


    the
    idiot
    the idiot
    

The test script will start the master and index servers in a new Screen session. If you are not familiar with screen, you are highly encouraged to learn more about it. It is an extremely valuable and versatile tool that you would surely find useful elsewhere.

You will find this setup useful for interactive debugging and monitoring of the system. Each server is started in a new window in the screen session. At any point during the experiment, you can "view" the experiment by running screen -r foo in another shell. The output on each window is automatically logged (look for screenlog.* files in the current directory).

The test script will start by issue each query found in queries.list. It will then start randomly killing the index servers (up to half of the servers will be terminated). After killing a server, the script will reissue all the queries again -- this should excercise the failure detection and replication mechanisms of the system.

IMPORTANT: If you are running/debugging without the script, make sure to set the LD_LIBRARY_PATH variable appropriately. In your shell, do:


    $ export LD_LIBRARY_PATH=/net/global/cse124/xmlrpc-c/lib:${LD_LIBRARY_PATH}
    
It is probably best to just include it in your .bashrc.

Downloads

  • lab3.tar.bz2: base source. It has been tested to compile on any of the VMs.
  • run-test.sh: place run-test.sh in the same directory as the source code.
  • screenrc: sample screenrc for use with run-test.sh. Feel free to modify. Should be placed in the same directory as run-test.sh.

Submission

For this project, you will submit your modified code (NO binaries, please!) as well a script to partition an inverted index, as required for your code to work correctly. You must also submit the set of index files you used for your testing, as well as what each index server was serving (the terms file).

Since your scheme for partitioning and replication is an important aspect of this assignment, we need to evaluate it beyond the textual description. We will also be testing your code on our own inverted index. For this to happen, we need some mechanism to take our inverted index, and partition/replicate it in a way that your code expects. You should provide us with some mechanism to do this.

Your script could be in any language -- C, C++, Java, Python, Perl, Ruby -- it doesn't matter, as long as it does the job. The script should take as input two parameters: a file name consisting of the inverted index, and the total number of index servers. Assume that the inverted index is already sorted lexicographically. The format of the index file is very simple. Each line contains a single word followed by the list of documents it appears in. All fields will be tab separated.

The script should output N directories, named part-1, part-2, ..., part-N. Each directory should contain two files: index and range, where the contents of the files should be as described earlier. We will simply take each directory and provide it as input to the corresponding index server (so server 1 will use part-1). Your script will be responsible for generating the files correctly.