Project 2 - MapReduce

MapReduce allows for relatively fast and easy processing over very large datasets using a cluster of commodity machines. In this assignment we will become more familiar with the MapReduce paradigm and the open source Java implementation, Hadoop. This document will walk you through the process of settting up and executing a MapReduce task over the Netflix Prize dataset on a single node. Afterwards, you will be given a list of tasks to implement in your own MapReduce jobs and analyze.

Check-in due: Tuesday, February 17th, 2015
Project due: Tuesday, February 24th, 2015

Setup

Installing the VM

Rather than having you set up Hadoop on your own machines, we will be using the Cloudera QuickStart VM, which comes with Hadoop and several development tools, including Eclipse, pre-installed.

Setting up the VM

Download the starter code and data set to your computer (available via the same link as above) and copy it into the VM by clicking and dragging (or setting up a shared folder if you'd like to code on your machine, but run in the VM).

Compiling and Running a job

Hadoop requires that MapReduce code is packaged in a jar file to be executed via the hadoop command. To create a jar file from the provided source files, execute the following commands:

$ javac -cp "/usr/lib/hadoop/client/*" <list of java files>
$ jar cvf example.jar <list of class files>

By default, Hadoop will search for files not on the computer's local storage, but in the Hadoop Distrubited File System, or hdfs. hadoop fs will show you the list of commands you can perform on hdfs, including familiar commands such as hadoop fs -ls, -cp, and -mv. To copy files to and from hdfs, we use the -copyFromLocal and -copyToLocal commands, like so:

$ hadoop fs -copyFromLocal <local folder/file> <hdfs folder/file>
$ hadoop fs -copyToLocal <hdfs folder/file> <local folder/file>

We are then ready to run our Hadoop job. Hadoop takes as arguments the jar file we wish to run, the main class from within that jar file, and then arguments to that class.

$ hadoop jar example.jar <main class> <args>

A specific example of running a job can be found below.

Dataset

Before any MapReduce tasks we will need to understand understand the formatting of the dataset as well as load it on the image. The formatting of the Netflix Prize dataset is described in it’s accompanying README. The description is as follows:

The file "training_set.tar" is a tar of a directory containing 17770 files, one
per movie.  The first line of each file contains the movie id followed by a
colon.  Each subsequent line in the file corresponds to a rating from a customer
and its date in the following format:

CustomerID,Rating,Date

- MovieIDs range from 1 to 17770 sequentially.
- CustomerIDs range from 1 to 2649429, with gaps. There are 480189 users.
- Ratings are on a five star (integral) scale from 1 to 5.
- Dates have the format YYYY-MM-DD.

The customer ID will be used to identify unique ratings and the rating is the element of the dataset we will be aggregating via Map and Reduce tasks.

The chunk of the dataset that we will be using doesn't include all 17770 movies, as that would take far too long to run. We have provided you with a subset of the Netflix data to run on.

Assignment

For this assignment, you will be writing a series of MapReduce jobs that will calculate various metrics over the Netflix dataset. You will then have to answer questions about your results, and in some cases graph those results. Your answers and graphs will be compiled in a writeup which you will turn in along with your code.

0. Example: Average User Rating

We've already done this one for you! Check out the example. No need to turn anything in for this one, but it will help you with the later metrics.

1. Ratings per Date

Each rating in the dataset is accompanied by the date on which it was submitted. Your task is to create the MapReduce methods that count the number of times a rating was submitted for a specific date. The following provides a snippet example of what the input and output files will look like:

Input:
...
33:
1623180,5,2005-07-11
282486,3,2005-07-12
1987434,4,2005-07-13
34:
1623180,2,2005-07-11
...

Output:

...
2005-07-11  2
2005-07-12  1
2005-07-13  1
...

2. Ratings by Stars per Year

For each year, print out the number of 1 star ratings, the number of 2 star ratings, 3 star, 4 star, and 5 star ratings.

3. How active are users?

4. Polar Reviews

Deliverables

There are two turn-ins for this project: a check-in deadline, and a final turn-in.

Check-in

Check-in is due Tuesday, February 17th, 2015.

By this date, you must have chosen your partner and completed analysis of the first metric, "Ratings per date".

Your turn-in should be a compressed tarball (.tar.gz) containing:

Final Turn-in

The final turn-in is due Tuesday February 24th, 2015

Your turn-in should be a compressed tarball (.tar.gz) containing:


Adapted from California Polytechnic State University's 2008 "Laboratory Assignment: MapReduce Paradigm in a VirtualGrid Environment"
Licensed under Creative Commons Attribution 3.0 License - http://creativecommons.org/licenses/by/3.0