CSE 124: Networked Services

Assignment 2: Hadoop


Introduction

Please read the Google Map Reduce paper by Jeff Dean and Sanjay Ghemawat.

Here is a short introduction to Hadoop

The goal of this project is to become familiar with a popular model for programming distributed systems.

It is recommended to do this project in teams of two, but not required. Hadoop is a large system, and it can take some time to familiarize yourself with it.

Get started early! This system is hard to debug, and there are a lot of little tricks to it.

Part 0

In order to setup accounts on the cluster, all students must submit an RSA (or DSA) public ssh key and list the other member of their group. To generate a key pair, run the command: ssh-keygen -t rsa

For security, please set a password on your private key. Because Hadoop requires password-less access, you should set up an ssh-agent or use a tool such as keychain to cache your password.

Once you have generated your key pair, please submit your public key using the lab 2 registration form.

Local Setup

To begin developing on your local machine, please download and install the Hadoop package. You can begin familiarizing yourself with Hadoop by looking at the wiki documentation and reading our Map Reduce example.

Part 1

Build an Inverted Index.

Due: November 12th at 23:59

This small task will get you familiar with Hadoop and give you some basic experience with the interface before you dive into full Wikipedia clustering.

This is fairly straightforward. You will build an inverted index of words to the documents which contain them. To begin testing, use any plain text documents that you have available (or you may download some from sites such as Project Gutenberg). Your end result will be something of the form: (word, docid[]).

The one challenge with this is that the default file format does not provide you with the name of the file in your map function. You will have to figure some way around this (such as writing a new InputFormat class.) This will get you a little more intimate with the workings of Hadoop.

You will also have to do a little more than simply tokenize the texts on whitespace. Make sure that punctuation and case also get stripped. However, ensure that contractions do not change meaning, like "it's" becoming posessive "its".

Cluster

You have been provided with an account on the JobTracker machine of our cluster, 172.22.14.241. You can ssh to this machine using the login groupN, where N is your group number (you should receive an email as to which group you have been assigned). These machines do not have publicly routable IP addresses, so incomming ssh connecions from non-UCSD machines will be rejected, so you should either use a UCSD proxy or otherwise login from a UCSD IP.

On the JobTracker machine, you will find a pre-configured Hadoop installation for each group under /n/cse124/hadoopN, where N is your group number. You should use the corresponding hadoop script, /n/cse124/hadoopN/bin/hadoop to issue HDFS commands or submit jobs to the cluster. Each group has its own HDFS (which is the default path for hadoop fs commands). You do not need to (and should not attempt to) format the file system or run the hadoop servers with the start-all.sh script---all the necessary Hadoop servers are already running. Please email the TA if you cannot ssh into the JobTracker machine or cannot run the bin/hadoop script from that machine.

For the first part of the assignment, you should build an inverted index for the files located in /n/cse124/input/classics. You will want to copy this directory into your HDFS directory so that you can use it for your indexer.

Each group has its own Hadoop cluster. The Hadoop JobTracker will schedule jobs FIFO as soon as machines become available. You can check the status of the cluster and view output logs by connecting to http://172.22.14.241:5003N (N is your group number)---this will also only work from hosts on the UCSD network.

Turnin: a tar file containing the jar file which you used to submit map reduce jobs (build.jar if you use the Makefile), source code, and a README file explaining how your indexer works. If your generated inverted index is less than 1MB, include the index in your tar submission; otherwise, include the path to your index in HDFS (and please ensure that it stays there). Use the lab 2.1 submission form.

Part 2

Cluster Wikipedia Articles

Due: December 4th at 23:59

For this part you'll have to run a multi-pass map reduce. That is, output from one pass will become input to the next pass. /n/cse124/wikipedia/full contains the full text of Wikipedia. We have done some preliminary pre-processing for you, re-formatting the contents so that they appear one article per line. The article name is from the beginning of the line to the first '%'. The article text follows. Redirect pages have not been stripped out (this is why there are many more articles in that file than there are reported on the main Wikipedia page). Newlines from the wikisource have been converted to spaces, so some formatting has been lost. There are also random subsets articles which may be useful for testing in /n/cse124/wikipedia/rand. The full Wikipedia collection contains approximately 5.65 million entries (including redirects).

Your task is to take all of these wikipedia articles and cluster them, figuring out which articles are related to each other.

We suggest using K-Means for clustering. It is one of the simplest clustering algorithms (although there are many problems with it, that is not the point of this assignment). K-Means is called so because it incrementally improves our data into k clusters around k means (centroids). Conceptually it is pretty simple - compare each item to each centroid, and assigning an item to the cluster to which it is closest. When the whole dataset is so partitioned, find the average of each item in each cluster, and that mean becomes the new centroid for that cluster. Repeat, continually refining your centers.

Although this project is fairly open-ended, we will provide you with one approach that you may follow if you wish. For this approach, the pre-processing step involves building a linked-article index, where the keys are article names, and the values are articles to which they link. This requires understanding the wiki format, as well as sanity checking the input for bad article names and links, and extracting the links only for other articles.

You will need to figure out a good distance function. The intutitive answer may be how many hops away one article is from another, but to do that you need the whole graph. With map reduce, you only have access to a single article and some small amount of static data at any given time. Think about other metrics capturing how related two articles are.

When you feel that you have refined enough (or you do not have any more time), you will need to present your data somehow. Choose the most representative articles of each cluster and display them by distance from their centroid while making available their article title. Show each centroid (which may or may not be an article at this point) by displaying some appropriate subset of its links (such as the most popular).

Tasks
  • Pre-process article data into a usable form.
  • Initialize centroids/clusters.
  • K-Means iteration. Repeat.
  • Presentation of results.
Hints

You need to pick out a reasonable value for k, the number of clusters. Different sizes of k can greatly effect your results. You might start with a fixed size per cluster, say between 100-1000 articles per cluster, and then compute how many clusters you need based on the total number of articles. If you have time, it is recommended that you see how your results vary with different values for k.

You can store a small amount of state in HDFS that you can reference/update in your Map and Reduce tasks. This might be a convenient way to store your centroids. If you find them useful, we are providing you with partial implementations of Centroid and Cluster classes. You may use these directly, copy code from them, or not use them at all. Each Centroid maintains an array of links, which represent its feature set. The Clusters class maintains the set of all centroids, and can serialize and deserialize the centroids to HDFS. Make sure that you set the parameters "input.centroids.path" and "output.centroids.path" to directories in HDFS in your JobConf, as the clusters object uses these parameters to know where to load and save its state. The Clusters class also contains a main() method that allows you to convert a serialized representation of the centroids to plain text for debugging/analysis.

Extra Credit

Extra credit will be given for better clustering results. Possible strategies to achieve this include using a more sophisticated method to seed the initial clusters/centroids, using multiple metrics for computing centroid distance, refining clusters using other metrics, and tools for analyzing/presenting the resulting clusters to determine the effectiveness of the clustering.

Turnin: a tar file containing the jar file which you used to submit map reduce jobs; source code; a description of your approach, including how you pre-processed your data, how you initialized your clusters, your distance function, your map and reduce tasks, any special features you implemented for improved results (ideally with examples illustrating how they helped), etc; and a selection of approximately 100 representative clusters (or subsets).

Lab 2 submission form

In addition to submitting your code and results, each group will meet with the TA to present their implementation and results.