CSE 124: Networked Services

Assignment 2: Hadoop


Introduction

Please read the Google Map Reduce paper by Jeff Dean and Sanjay Ghemawat.

Here is a short introduction to Hadoop

The goal of this project is to become familiar with a popular model for programming distributed systems.

It is recommended to do this project in teams of two, but not required. Hadoop is a large system, and it can take some time to familiarize yourself with it.

Get started early! This system is hard to debug, and there are a lot of little tricks to it. Make good use of office hours.

Setup

You have all been provided with a set of virtual machines. These will be named N.groupX.apking.usher.ucsdsys.net Where N is between 1 and 8, and X is your group number (you should receive an email as to which group number has been assigned to you). These machines have publicly routable IP addresses, but incomming ssh connecions from non-UCSD machines seem to be rejected, so you might have to either use a UCSD proxy or otherwise login from a UCSD IP.

The virtual machines all share /net/global/cse124. There are several files there which are of import.

/net/global/cse124/input/classics This contains a lot of text files from Project Gutenburg. Useful as a small dataset on which to run simple text-processing.
/net/global/cse124/input/wikipedia This contains the full text of wikipedia, one article per line. The article name is from the beginning of the line to the first '%'. The article text follows. Redirect pages have not been stripped out (this is why there are many more articles in that file than there are reported on the main Wikipedia page). Newlines from the wikisource have been converted to spaces, so some formatting has been lost. There is also a random subset of some 50,000 articles which may be useful for testing.
/net/global/cse124/default You'll probably want to copy this directory to your vm.
/net/global/cse124/default/bin/hadoop This is the main hadoop script. It will submit jobs to the cluster and allow you to manipulate existing jobs. Run it without any arguments to print usage.

Steps

  1. Run /net/global/cse124/default/setup-hadoop.sh in your home directory on the VM 1.groupX.apking.usher.ucsdsys.net, where X is your group ID. This will setup your SSH keys, ensure that you have password-less access (required by Hadoop) on all the VMs in your cluster, and create directories neede by Hadoop. If you encounter any errors in this step, please contact one of the TAs. DO NOT PROCEED FURTHER if this step fails.
  2. Run default/bin/hadoop namenode -format This will create a new distributed file system across all your nodes.
  3. Run default/bin/start-all.sh to start up all the nodes in your cluster (those taking part in the distributed file system and those taking part in the map reduce computations, which should be the same machines in this isntance).
  4. Transfer data onto your distributed file system with default/bin/hadoop dfs -put local_file remote_file  You will likely want to do this for the files in /net/global/cse124/input. Be advised, this can take some time, especially with replication.
  5. Submit jobs.
    1. Put a .jar file containing your class onto a vm (usually 1.groupX.apking.usher.ucsdsys.net). Run bin/hadoop jar path/to/jar.jar path.to.class <args to class> It is recommend that you try compiling and running the Trivial example first. The example is posted on the Writing a Map Reduce Page.
    2. If you have trouble with your jobs, you can check out the FAQ on the Debugging Page.
  6. Run default/bin/stop-all.sh to stop all the nodes in your cluster.

Part 1

Build an Inverted Index.

Due: November 7th at 23:59

This small task will get you familiar with Hadoop and give you some basic experience with the interface before you dive into full Wikipedia clustering.

This is fairly straightforward. You will build an inverted index of words to the documents which contain them. Please do this on the /net/global/cse124/input/classics dataset. Your end result will be something of the form: (word, docid[]).

The big hiccup here is that the default file format doesn't provide you with the name of the file in your map function. You will have to figure some way around this (such as putting the name of the file on each line in the file, or much better, writing a new InputFormat class.) This will get you a little more intimate with the workings of Hadoop.

You'll also have to do a little more than simply tokenize the texts on whitespace. Make sure that punctuation and case also get stripped. But ensure that contractions don't change meaning, (like "it's" becoming posessive "its").

Turnin: tarball of: your source code, a jar which you used to submit map reduce jobs (build.jar if you use the Makefile). If your inverted index output is less than 1Mb, please include it in the tarball, otherwise include a location where it can be found on your DFS (please ensure that it stays there). Email the tarball to the TAs with the subject formatted as: [cse124:groupX:2.1], where groupX is your group number.

Part 2

Cluster Wikipedia Articles

Due: November 21st at 23:59

For this part you'll have to run a multi-pass map reduce. That is, output from one pass will become input to the next pass. You will be given a collection of files which will contain a median of 1000 wikipedia articles apiece. Your task is to take all of these wikipedia articles and cluster them, figuring out which articles are related to eachother.

For clustering, I'm going to suggest K-Means. It is one of the simplest clustering algorithms (although there are many problems with it, that's not the point of this assignment). K-Means is called so because it incrementally improves our data into k clusters around k means (centroids). Conceptually it's pretty simple - compare each item to each centroid, and whichever one it's closest too, that's the cluster to which it belongs. When the whole dataset is so partitioned, find the average of each item in each cluster, and that mean becomes the new centroid for that cluster. Repeat, continually refining your centers.

You'll need to figure out a good distance function. The intutitive answer may be how many hops away one article is from another, but to do that you need the whole graph. And with map reduce, you only have access to a single article and some small amount of static data at any given time. What else is a good indication of how related two articles are?

When you feel you've refined enough (or you don't have any more time), you'll need to present your data somehow. Choose the most representative articles of each cluster and display them by distance from their centroid while making available their article title. Show each centroid (which may or may not be an article at this point) by displaying some appropriate subset of its links (such as the most popular).

Tasks
  • Pre-process article data into a usable form.
  • K-Means iteration. Repeat.
  • Presentation of results.