CSE 130 - Programming Assignment #6

MapReduce and Python

120 points

Must be turned in no later than 11:59:59 PM on 03/10/2008
(see submission instructions below)

(click your browser's refresh button to ensure that you have the most recent version)

Note about Java: While you can any system that has Java in order to implement your MapReduce solution, the code you turn in must run correctly on the ACS Linux environment.

Note about Python: To download and install Python version 2.5 on your home machines see this page. Remember that this is only to enable you to play with the assignment at home: The final version turned in must work on the ACS Linux machines. While you can use Windows to begin working with Python, the code you turn in must run in the ACS Linux environment.


Integrity of Scholarship

University rules on integrity of scholarship will be strictly enforced. By completing this assignment, you implicitly agree to abide by the UCSD Policy on Integrity of Scholarship described beginning on page 68 of the Academic Regulations section (PDF) of the 2007-2008 General Catalog, in particular, "all academic work will be done by the student to whom it is assigned, without unauthorized aid of any kind."

You are expected to do your own work on this assignment; there are no group projects in this course. You may (and are encouraged to) engage in general discussions with your classmates regarding the assignment, but specific details of a solution, including the solution itself, must always be your own work. Incidents that violate the University's rules on integrity of scholarship will be taken seriously: In addition to receiving a zero (0) on the assignment, students may also face other penalties, up to and including, expulsion from the University. Should you have any doubt about the moral and/or ethical implications of an activity associated with the completion of this assignment, please see the instructors.


Code Documentation and General Requirements

Code for all programming assignments should be well documented. A working program with no comments will receive only partial credit. Documentation entails writing a description of each function/method, class/structure, as well as comments throughout the code to explain the program logic. It is understood that some of the exercises in this programming assignment require extremely little code and will not require extensive comments.

While few programming assignments pretend to mimic the "real" world, they may, nevertheless, contain some of the ambiguity that exists outside the classroom. If, for example, an assignment is amenable to differing interpretations, such that more than one algorithm may implement a correct solution to the assignment, it is incumbent upon the programmer to document not only the functionality of the algorithm (and more broadly his/her interpretation of the program requirements), but to articulate clearly the reasoning behind a particular choice of solution.


Assignment Overview

The objective of this assignment is to introduce you to MapReduce and some more advanced features of Python. All the files needed for this assignment can be found in the following zip file.

MapReduce Part: The MapReduce part of this assignment is spread over a variety of java files in the provided zip file. These java files contain a simple implementation of MapReduce, and some applications that use MapReduce. Some of the java files have methods that throw RuntimeException("Implement me!"). Your task in problem 1 is to implement these methods.

Python Part:The Python part of this assignment is spread over five python files that are available in the provided zip file: misc.py, vector.py, decorators.py, test.py, and the sample output file decorators.out. These files contain several skeleton Python functions and classes, with missing bodies or missing definitions, which currently contain the text raise Failure("to be written") or are present only as comments. Your task in problems 2 and 3 is to replace the text in those files with the the appropriate Python code for each of those expressions. An emphasis should be placed on writing concise easy to read code.


Assignment Testing and Evaluation

Your functions/programs must compile and/or run on a Linux ACS machine (e.g. ieng6.ucsd.edu), as this is where the verification of your solutions will occur. While you may develop your code on any system, ensure that your code runs as expected on an ACS machine prior to submission. You should test your code in the directories from which the zip files (see below) will be created, as this will approximate the environment used for grading the assignment.

Most of the points, except those for comments and style, will be awarded automatically, by evaluating your functions against a given test suite. The file, test.py contains a very small suite of tests which gives you a flavor of these tests. At any stage, by typing at the UNIX shell :

python < test.py | grep "130>>" > log
you will get a report on how your code stacks up against the simple tests.

The last (or near the bottom) line of the file log must contain the word "Compiled" otherwise you get a zero for the whole assignment. If for some problem, you cannot get the code to compile, leave it as is with the raise ..., with your partial solution enclosed below as a comment. There will be no exceptions to this rule. The second last line of the log file will contain your overall score, and the other lines will give you a readout for each test. You are encouraged to try to understand the code in test.py, and subsequently devise your own tests and add them to test.py, but you will not be graded on this.

Alternately, inside the Python shell, type (user input is in red):

>>> import test
.
.
.
130>> Results: ...
130>> Compiled
and it should print a pair of integers, reflecting your score and the max possible score on the sample tests. If instead an error message appears, your code will receive a zero.

Your java code also needs to compile with no errors. In particular, running javac *.java in your solution directory should not output any errors. If there are any compilation errors, you will not receive any credit.


Submission Instructions

1. Create the zip file for submission

Your solutions to this assignment will be stored in separate files under a directory called solution/, inside which you will place all the files required for this project (including makefiles, etc.). There should be no other files in the directory.

After creating and populating the directory as described above, create a zip file called <LastName>_<FirstName>_pa6.zip by going into the directory solution and executing the UNIX shell command:

zip <LastName>_<FirstName>_pa6.zip *

2. Submit the zip file via the turnin program

Once you've created the zip file with your solutions, you will use the turnin program to submit this file for grading by going into the directory solution/ and executing the UNIX shell command:

turnin -c cs130w -p pa6 <LastName>_<FirstName>_pa6.zip

The turnin program will provide you with a confirmation of the submission process; make sure that the size of the file indicated by turnin matches the size of your tar file. See the ACS Web page on turnin for more information on the operation of the program.


Problem #1: MapReduce

The goal of this problem is to familiarize you with the programming model behind Google's MapReduce framework. As we discussed in class, MapReduce is a programming paradigm inspired from functional programming that Google has developed in order to express their own internal computations over large data sets. Google's implementation of MapReduce is not available publicly, but Hadoop is an open-source publicly available version of MapReduce. It would be ideal to use Hadoop for this assignment, but Hadoop is a real system, and as with any real system, it takes time to get familiar with it, more time than one can spend on a single programming assignment for our class.

As a result, for you to get a flavor of what it is like to program in MapReduce, we have implemented our own simple version of MapReduce, called MiniMapReduce, that runs on a single computer. Although MiniMapReduce does not distribute computations across many computers, it will allow you to program in the MapReduce model and get familiar with it. Once you understand the model, your knowledge and experience will easily translate into more realistic MapReduce frameworks like Hadoop.

Good style matters. In addition to making your code clear and well-documented, make sure that the javac compiler never gives you this warning:

Note: Blah.java uses unchecked or unsafe operations.
Note: Recompile with -Xlint:unchecked for details.

This warning indicates that you are not using parameterized types properly. In particular, you are probably forgetting a type parameter somewhere, for example declaring a variable of type List rather than List<String>. If you get this warning, recompile using the specified flag, and the compiler will point out the errors that you need to fix.

(a) 0 points

To start, you need to get familiar with the MapReduce programming model. Read the following paper on MapReduce, written by its inventors at Google (you will need to access it from a UCSD URL in order to download the PDF for it). You only really need to read Sections 1 and 2, but if you are interested you can also read the rest. (You might also browse through the lecture notes on MapReduce from class, which are posted on the course web page.)

Now unzip the provided zip file, which contains MiniMapReduce (along with some Python files for Problems 2 and 3).

(b) 0 points

Java doesn't have first-class functions, as in OCaml, but they can be simulated with objects. Take a look at the Mapper interface, which intuitively represents one of the user-defined functions to pass to MapReduce. The interface has a single method named apply representing this function; note that it has the same type as specified for the user-defined map function in Section 2.2 of the MapReduce paper. So m.apply(...), where m is a user-defined implementation of Mapper, can be used to invoke the user's desired mapping function. Similarly, our Reducer interface has a single method representing the user-defined reduce function. The interface Map.Entry in Java's standard library represents a (key,value) pair. Our MREntry class implements this interface.

Now look at the file WordCount.java, which gives you an implementation of the word count example that was covered in class and that was explained in the MapReduce paper. The WordCount class counts the number of times that words occur in a set of files that are passed in on the command line. The output is written to a file wordcount.out. To compile the code, type javac WordCount.java, and to run it, type java WordCount http*. This will count the words of a bunch of test html files we have provided, which come from www.cnn.com, and put the results in the file wordcount.out.

All that WordCount does is create an implementation of Mapper, create an implementation of Reducer, and then use these to create an instance of MiniMapReduce that solves the problem. The constructor of MiniMapReduce takes the user's Mapper and Reducer implementation. MiniMapReduce also has a function mapReduce that takes an initial list of (key,value) pairs and performs the mapReduce computation using that Mapper and Reducer. Our FileUtil class has helper methods for reading a file's contents into a String and for writing the results of a MapReduce computation to a file.

The user-defined implementations of Mapper and Reducer use Java's ability to create anonymous classes. For example, the first line of main in WordCount.java creates a new (unnamed) class that meets the Mapper interface, provides an implementation for apply, and creates an instance of this class, all in one declaration. A Java equivalent of anonymous functions like (fn x => ...) in OCaml!

(c) 30 points

Complete the implementation of the URL count example that is found in URLCount.java. The URL counter example should count the number of times that each URL occurs in the files that are passed in on the command line. It is very similar to word count, except that you are not counting words, but rather URLs. You simply need to fill in the implementation of the two apply functions in URLCount.java, one for the map implementation, and one for the reduce implementation. Currently they each just throw an exception.

For simplicity, you can assume that a URL is always between double quotes, and always starts with http://. Check out Java's StringTokenizer class, which can help you isolate these URLs. Once your code is working and compiled, you can type java URLCount http* to test your URL counter on the html files we have provided you. This produces a file urlcount.out with the results. To help you figure out if your code is working right, we have included a file named urlcount-soln.out, which contains the output of running our solution on the provided html files. You should also test your code on other html files to make sure it works.

(d) 30 points

Complete the implementation of the reverse web link example that is found in ReverseWebLinks.java. Given a set of web pages, ReverseWebLinks should compute, for each URL linked from at least one of those pages, the list of pages that directly link to that URL. The end result should be a list of pairs (a,b), where a is a URL and b is the list of pages (also represented as URLs) that directly link to a. If a single web page contains inside of it multiple links to a, then this web page should appear multiple times in the list b, once for each link to a. The set of web pages to consider is passed on the command line as a set of file names. Each file name represents a web page. The html content of the web page can be found by just reading the file from the file system, as you did in the previous problem. The URL of the web page is encoded directly in the filename. For example, the file http%--www.cnn.com-US- contains the contents of the web site http://www.cnn.com/US/. We use - instead of / and % instead of : because file names cannot contain slashes or colons.

Once your code is working and compiled, you can type java ReverseWebLinks http* to test your code on the html files we have provided you. This produces a file reverse.out with the results. To help you figure out if your code is working right, we have included a file named reverse-soln.out, which contains the output of running our solution on the provided html files. You should also test your code on other html files to make sure it works.


Problem #2: Vector class (vector.py)

The Vector class that you will implement will be a fixed length vector which implements a variety of operations. Start with the skeleton code provided in vector.py contained in the provided zip file.

(a) 10 points

Add a constructor to the Vector class. The constructor should take a single argument. If this argument is either an int or a long or an instance of a class derived from one of these, then consider this argument to be the length of the Vector. In this case, construct a Vector of the specified length with each element is initialized to 0.0. If the length is negative, raise a ValueError with an appropriate message. If the argument is not considered to be the length, then if the argument is a sequence (such as a list), then initialize with vector with the length and values of the given sequence. If the argument is not used as the length of the vector and if it is not a sequence, then raise a TypeError with an appropriate message.

Next implement the __repr__ method to return a string of python code which could be used to initialize the Vector. This string of code should consist of the name of the class followed by an open parenthesis followed by the contents of the vector represented as a list followed by a close parenthisis. Once you have implemented the function, you should get the following behavior at the Python prompt:

>>> from vector import *
>>> Vector(3)
Vector([0.0, 0.0, 0.0])
>>> Vector(3L)
Vector([0.0, 0.0, 0.0])
>>> Vector([4.5, "foo", 0])
Vector([4.5, 'foo', 0])
>>> Vector(0)
Vector([])
>>> Vector(-4)
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
  File "vector.py", line 14, in __init__
    raise ValueError("Vector length cannot be negative")
ValueError: Vector length cannot be negative

(b) 10 points

Implement the functions __len__ and __iter__ in Vector. The function __len__ should return the length of the Vector. The function __iter__ should return an object that can iterate over the elements of the Vector. This is most easily done using yield(). See here for more information.

Once you have implemented the function, you should get the following behavior at the Python prompt:

>>> from vector import *
>>> [x*2 for x in Vector([3,3.25,"foo"])]
[6, 6.5, 'foofoo']
>>> len(Vector(23))
23

(c) 10 points

Implement the + and += operators for Vector. The other argument to + and the second argument to += can be any sequence of the same length as the vector. All of these should implement component-wise addition. See here for more information on how to implement these operators. Other arithmetic operators are implemented in a similar way, however it is not required for this assignment.

Once you have implemented the function, you should get the following behavior at the Python prompt:

>>> from vector import *
>>> Vector([6,8,2])+Vector([4,-3,2])
Vector([10, 5, 4])
>>> Vector([6,8,2])+[4,-3,2]
Vector([10, 5, 4])
>>> (6,8,2)+Vector([4,-3,2])
Vector([10, 5, 4])
>>> v=Vector(["f","b"])
>>> v+=("oo","oo")
>>> v
Vector(['foo', 'boo'])


Problem #3: Decorators (decorators.py)

In the file decorators.py (contained in the provided zip file) there is an example decorator, profiled and stub code for decorators that you will be writing. At the bottom of the file are many examples of decorated functions. The expected output for these functions is available in the file decorators.out (also contained in the provided zip file).

(a) 30 points

Complete the definition for the decorator traced. When the decorated function is called, the decorator should print out an ASCII art tree of the recursive calls and their return values. The format of the tree should be as follows:
  1. Print a pipe symbol followed by a space ("| ") for every level of nested function calls.
  2. Print a comma then a minus sign then a space (",- ") next.
  3. Print the name of the function being traced followed by an open parenthesis followed by the repr() of all of the arguments. Arguments should be seperated by a comma followed by a space (", "). After the normal arguments, print all the keyword arguments in the form keyword then equals sign then repr() of the value of the keyword argument. The keyword arguments should also be seperated by a comma followed by a space. Keyword arguments should be printed in the order returnd by dict.items().
  4. Next increase the nesting level and call the function itself.
  5. At the original nesting level, print a pipe symbol followed by a space ("| ") for every level of nested function calls.
  6. Print a backquote then a minus sign then a space("`- ").
  7. Finally, print the repr() of the return value.
The return value of the function should be return to the caller after all printing is complete. If an exception occurs in the funciton, the nesting level must be adjusted to the appropriate level where the exception is caught. See change for an example.

Once you have implemented the function, you should get the following behavior at the Python prompt:

>>> from decorators import *
>>> @traced
>>> def foo(a,b):
...    if a==0: return b
...    return foo(b=a-1,a=b-1)
>>> foo(4,5)

,- foo(4, 5)
| ,- foo(a=4, b=3)
| | ,- foo(a=2, b=3)
| | | ,- foo(a=2, b=1)
| | | | ,- foo(a=0, b=1)
| | | | `- 1
| | | `- 1
| | `- 1
| `- 1
`- 1
1

(b) EXTRA CREDIT: 30 points

Complete the definition of the memoized decorator. When the decorated function is called, the decorator should check to see if the function has already been called with the given arguments. If so, the decorator should return the value the the function returned when it was last called with the given arguments. If the function last threw an exception when called with the given arguments, the same exception should be thrown again. If the function has not been called with the given arguments, then call it and record the return value or exception. Then return the return value or raise the thrown exception.

Once you have implemented the function, you should get the following behavior at the Python prompt:

>>> from decorators import *
>>> from time import sleep
>>> @memoized
>>> def foo(a):
...    sleep(a)
...    return a
>>> foo(5)
# 5 second pause
5
>>> foo(5))
# practically instantaneous
5