CSE 130 - Programming Assignment #6


240 points

Must be turned in no later than 4:59:59 PM on 11/19/2010
(see submission instructions below)

(click your browser's refresh button to ensure that you have the most recent version)

Note: To download and install Python version 3.1.2 on your home machines see this page. Remember that this is only to enable you to play with the assignment at home: The final version turned in must work on the ACS Linux machines. While you can use MacOS or Windows to begin working with Python, the code you turn in must be that required for the Linux environment.

Integrity of Scholarship

University rules on integrity of scholarship will be strictly enforced. By completing this assignment, you implicitly agree to abide by the UCSD Policy on Integrity of Scholarship described beginning on page 68 of the Academic Regulations section (PDF) of the 2007-2008 General Catalog, in particular, "all academic work will be done by the student to whom it is assigned, without unauthorized aid of any kind."

You are expected to do your own work on this assignment; there are no group projects in this course. You may (and are encouraged to) engage in general discussions with your classmates regarding the assignment, but specific details of a solution, including the solution itself, must always be your own work. Incidents that violate the University's rules on integrity of scholarship will be taken seriously: In addition to receiving a zero (0) on the assignment, students may also face other penalties, up to and including, expulsion from the University. Should you have any doubt about the moral and/or ethical implications of an activity associated with the completion of this assignment, please see the instructors.

Code Documentation and General Requirements

Code for all programming assignments should be well documented. A working program with no comments will receive only partial credit. Documentation entails providing documentation strings for all methods, classes, packages, etc., and comments throughout the code to explain the program logic. Comments in Python are preceded by # and extend to the end of the line. Documentation strings are strings in the first line of a function, method, etc., and are accessible using help(foo), where foo is the name of the method, class, etc. It is understood that some of the exercises in this programming assignment require extremely little code and will not require extensive comments.

While few programming assignments pretend to mimic the "real" world, they may, nevertheless, contain some of the ambiguity that exists outside the classroom. If, for example, an assignment is amenable to differing interpretations, such that more than one algorithm may implement a correct solution to the assignment, it is incumbent upon the programmer to document not only the functionality of the algorithm (and more broadly his/her interpretation of the program requirements), but to articulate clearly the reasoning behind a particular choice of solution.

Assignment Overview

The objective of this assignment is to introduce you to some more advanced features of Python. The first part of this assignment will cover topics from classes and OOP to higher order functions and the decorator pattern.

The second part will familiarize you with the programming model behind Google's MapReduce framework. As we discussed in class, MapReduce is a programming paradigm inspired from functional programming that Google has developed in order to express their own internal computations over large data sets. Google's implementation of MapReduce is not available publicly, but Hadoop is an open-source version publicly available version of MapReduce written in Java.

See the beginnings of the separate parts for the files you need to download.

Assignment Testing and Evaluation

Your functions/programs must compile and/or run on a Linux ACS machine (e.g. ieng6.ucsd.edu), as this is where the verification of your solutions will occur. While you may develop your code on any system, ensure that your code runs as expected on an ACS machine prior to submission. You should test your code in the directories from which the zip files (see below) will be created, as this will approximate the environment used for grading the assignment.

Most of the points, except those for comments and style, will be awarded automatically, by evaluating your functions against a given test suite. The file, test.py contains a very small suite of tests which gives you a flavor of these tests. At any stage, by typing at the UNIX shell :

python < test.py | grep "130>>" > log
you will get a report on how your code stacks up against the simple tests.

The last (or near the bottom) line of the file log must contain the word "Compiled" otherwise you get a zero for the whole assignment. If for some problem, you cannot get the code to compile, leave it as is with the raise ..., with your partial solution enclosed below as a comment. There will be no exceptions to this rule. The second last line of the log file will contain your overall score, and the other lines will give you a readout for each test. You are encouraged to try to understand the code in test.py, and subsequently devise your own tests and add them to test.py, but you will not be graded on this.

Alternately, inside the Python shell, type (user input is in red):

>>> import test
130>> Results: ...
130>> Compiled
and it should print a pair of integers, reflecting your score and the max possible score on the sample tests. If instead an error message appears, your code will receive a zero.

Submission Instructions

1. Create the zip file for submission

Your solutions to this assignment will be stored in separate files under a directory called pa6_solution/, inside which you will place the files: vector.py, decorators.py, ff.py, sg.py . These four files listed are the versions of the corresponding supplied files that you will have modified. There should be no other files in the directory.

After creating and populating the directory as described above, create a zip file called <LastName>_<FirstName>_pa6.zip by going into the directory pa6_solution and executing the UNIX shell command:

zip <LastName>_<FirstName>_pa6.zip *

2. Submit the zip file via the turnin program

Once you've created the zip file with your solutions, you will use the turnin program to submit this file for grading by going into the directory pa3_solution/ and executing the UNIX shell command:

turnin -c cs130f -p pa6 <LastName>_<FirstName>_pa6.zip

The turnin program will provide you with a confirmation of the submission process; make sure that the size of the file indicated by turnin matches the size of your zip file. See the ACS Web page on turnin for more information on the operation of the program.


May of the questions in this assignment use more advanced features of Python. You may find the following links useful.

Useful Links

Defining Classes

All classes used in this assignment should be new-style Python classes which have either object as a base class or only other new-style classes as base classes.
class foo(object):

class bar(foo):

Quirks / Features of Python

Part 1

This part of the assignment is spread over three python files vector.py, decorators.py, and test.py, the text files, and the sample output file decorators.out that you need to download. These files contain several skeleton Python functions and classes, with missing bodies or missing definitions, which currently contain the text raise Failure("to be written") or are present only as comments. Your task is to replace the text in those files with the the appropriate Python code for each of those expressions. An emphasis should be placed on writing concise easy to read code.

Problem #0: Documentation

None of the functions in this assignment are documented (except the Failure exception). You are expected to document all modules (.py files), all classes, and all public functions with doc strings. Doc strings should describe the behavior of the function/class/module. For example:
"The function prod takes two integers and returns their product".
If the implementation is not straightforward and obvious, there should be comments. For example:
# prod is implemented using a FFT to get O(n log n) time
Once documented you should get the following behavior at the Python prompt:

>>> import vector
>>> import decorators
>>> help(vector)
Screen full of documentation with all your doc strings
>>> help(decorators)
Screen full of documentation with all your doc strings

Problem #1: Vector class (vector.py)

The Vector class that you will implement will be a fixed length vector which implements a variety of operations.

(a) 10 points

Add a constructor to the Vector class. The constructor should take a single argument. If this argument is either an integer or an instance of a class derived from integer, then consider this argument to be the length of the Vector. In this case, construct a Vector of the specified length with each element is initialized to 0.0. If the length is negative, raise a ValueError with an appropriate message. If the argument is not considered to be the length, then if the argument is a sequence (such as a list), then initialize with vector with the length and values of the given sequence. If the argument is not used as the length of the vector and if it is not a sequence, then raise a TypeError with an appropriate message.

Next implement the __repr__ method to return a string of python code which could be used to initialize the Vector. This string of code should consist of the name of the class followed by an open parenthesis followed by the contents of the vector represented as a list followed by a close parenthisis. Once you have implemented the function, you should get the following behavior at the Python prompt:

>>> from vector import *
>>> Vector(3)
Vector([0.0, 0.0, 0.0])
>>> Vector([4.5, "foo", 0])
Vector([4.5, 'foo', 0])
>>> Vector(0)
>>> Vector(-4)
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
  File "vector.py", line 14, in __init__
    raise ValueError("Vector length cannot be negative")
ValueError: Vector length cannot be negative

(b) 10 points

Implement the functions __len__ and __iter__ in Vector. The function __len__ should return the length of the Vector. The function __iter__ should return an object that can iterate over the elements of the Vector. This is most easily done using yield(). See here for more information.

Once you have implemented the function, you should get the following behavior at the Python prompt:

>>> from vector import *
>>> [x*2 for x in Vector([3,3.25,"foo"])]
[6, 6.5, 'foofoo']
>>> len(Vector(23))

(c) 10 points

Implement the + and += operators for Vector. The other argument to + and the second argument to += can be any sequence of the same length as the vector. All of these should implement component-wise addition. See here for more information on how to implement these operators. Other arithmetic operators are implemented in a similar way, however it is not required for this assignment.

Once you have implemented the function, you should get the following behavior at the Python prompt:

>>> from vector import *
>>> Vector([6,8,2])+Vector([4,-3,2])
Vector([10, 5, 4])
>>> Vector([6,8,2])+[4,-3,2]
Vector([10, 5, 4])
>>> (6,8,2)+Vector([4,-3,2])
Vector([10, 5, 4])
>>> v=Vector(["f","b"])
>>> v+=("oo","oo")
>>> v
Vector(['foo', 'boo'])

(d) 10 points

Add the method dot which takes either a Vector or a sequence and returns the dot product of the argument with current Vector instance. The dot product is defined as the sum of the component-wise products. The behavior of this function if any elements are not numeric is undefined.

Once you have implemented the function, you should get the following behavior at the Python prompt:

>>> from vector import *
>>> Vector([6,8,2]).dot(Vector([4,-3,2]))
>>> Vector([6,8,2]).dot([4,-3,2])

(e) 10 points

Implement the __getitem__ and __setitem__ methods to allow element level access to the Vector. Indexing should be 0 based (as in C). If the index is negative, it should translate to the length of the Vector plus the index. Thus, index -1 is the last element. If the index is out of range, your implementation should raise an IndexError with an appropriate message. This behavior should be identical to that of a list. These methods should preserve the length of the Vector.

Once you have implemented the function, you should get the following behavior at the Python prompt:

>>> from vector import *
>>> v=Vector(7)
>>> v[4]
>>> v[4]="foo"
>>> v[4]
>>> v
Vector([0.0, 0.0, 0.0, 0.0, 'foo', 0.0, 0.0])

(f) 10 points

Extend your implementation of __getitem__ and __setitem__ methods to allow slice level access to the Vector. These methods should preserve the length of the Vector. If an assignment to a slice would change the length of the vector, raise a ValueError exception. The semantics otherwise should mimic those of list.

Once you have implemented the function, you should get the following behavior at the Python prompt:

>>> from vector import *
>>> v=Vector(7)
>>> v[2:4]=[4,4]
>>> v
Vector([0.0, 0.0, 4, 4, 0.0, 0.0, 0.0])
>>> v[6:2:-3]=[-1,-2]
>>> v
Vector([0.0, 0.0, 4, -2, 0.0, 0.0, -1])

(g) 10 points

Implement comparison functions for Vectors. Two vectors should be considered equal if each element in the first Vector is equal to the respective element in the second Vector. A Vector, a, should be considered greater than a Vector, b, if the largest element of a is greater than the largest element of b. If the largest elements of both are equal, then compare the second-largest elements, and so forth. If every pair compared in this fashion is equal, then a should not be considered greater than b, but a should be considered greater than or equal to b. If a is greater than b, then b is less than a. This is a nonstandard method for comparing vectors, and for a pair of vectors v and w, v>=w does not imply that v>w or v==w. When a Vector is compared to something that isn't a Vector, they should never be equal, and the result should be the same as if the respective comparison function were called in in the class object.

Once you have implemented the function, you should get the following behavior at the Python prompt:

>>> from vector import *
>>> a=Vector([1,3,5])
>>> b=Vector([5,1,3])
>>> c=Vector([4,5,4])
>>> a<b
>>> a>b
>>> a>=b
>>> a>c
>>> a<c
>>> a>=c
>>> a==c
>>> a==a
>>> a!=c
>>> a!=[1,3,5]

Problem #2: Decorators (decorators.py)

Here are some links on *args and **args tutorial, python manual. In the file decorators.py there is an example decorator, profiled and stub code for decorators that you will be writing. At the bottom of the file are many examples of decorated functions. The expected output for these functions is available here: decorators.out.

(a) 30 points

Complete the definition for the decorator traced. When the decorated function is called, the decorator should print out an ASCII art tree of the recursive calls and their return values. The format of the tree should be as follows:
  1. Print a pipe symbol followed by a space ("| ") for every level of nested function calls.
  2. Print a comma then a minus sign then a space (",- ") next.
  3. Print the name of the function being traced followed by an open parenthesis followed by the repr() of all of the arguments. Arguments should be seperated by a comma followed by a space (", "). After the normal arguments, print all the keyword arguments in the form keyword then equals sign then repr() of the value of the keyword argument. The keyword arguments should also be seperated by a comma followed by a space. Keyword arguments should be printed in the order returnd by dict.items().
  4. Next increase the nesting level and call the function itself.
  5. At the original nesting level, print a pipe symbol followed by a space ("| ") for every level of nested function calls.
  6. Print a backquote then a minus sign then a space("`- ").
  7. Finally, print the repr() of the return value.
The return value of the function should be return to the caller after all printing is complete. If an exception occurs in the funciton, the nesting level must be adjusted to the appropriate level where the exception is caught. See change for an example.

Once you have implemented the function, you should get the following behavior at the Python prompt:

>>> from decorators import *
>>> @traced
>>> def foo(a,b):
...    if a==0: return b
...    return foo(b=a-1,a=b-1)
>>> foo(4,5)

,- foo(4, 5)
| ,- foo(a=4, b=3)
| | ,- foo(a=2, b=3)
| | | ,- foo(a=2, b=1)
| | | | ,- foo(a=0, b=1)
| | | | `- 1
| | | `- 1
| | `- 1
| `- 1
`- 1

(b) 30 points

Complete the definition of the memoized decorator. When the decorated function is called, the decorator should check to see if the function has already been called with the given arguments. If so, the decorator should return the value the the function returned when it was last called with the given arguments. If the function last threw an exception when called with the given arguments, the same exception should be thrown again. If the function has not been called with the given arguments, then call it and record the return value or exception. Then return the return value or raise the thrown exception.

Once you have implemented the function, you should get the following behavior at the Python prompt:

>>> from decorators import *
>>> from time import sleep
>>> @memoized
>>> def foo(a):
...    sleep(a)
...    return a
>>> foo(5)
# 5 second pause
>>> foo(5))
# practically instantaneous

Part 2

In this assignment, we will use a small, very specialized version of MapReduce in Python to implement our desired algorithm, and we will run it on a single machine.

To start, you need to get familiar with the MapReduce programming model. Read the MapReduce paper, written by its inventors at Google. You only really need to read Sections 1 and 2, but if you are interested you can also read the rest.

For this assignment, we change the return type of reduce to only return a single value, which is the common usage as mentioned in the paper. So the types for map and reduce (using the notation from the paper) will look like
    map     (k1,v1)        ->  list(k2,v2)
    reduce  (k2,list(v2))  ->  v2
Download the files ff.py, sg.py, huckleberry, and gulliver. You will complete the function definitions wherever you see raise Exception("not implemented"). Some additional books to play with.

Problem #1 -- Fragment Finder (ff.py)

This problem consists of two MapReduce phases, with the goal of trying to randomly generate sentences that are close to making some sense. Once finished, you will then piece together five-word sequences to build sentences. For example, if some of the five-word sequences in the final output are "no use to try to", "to keep a journal on", "on tother side of the", "the first thing that come", then you could form the sentence "no use to try to keep a journal on tother side of the first thing that come".

Here's a more detailed overview of how the algorithm will proceed:

Intermediate results will be written in a subdirectory called output/, so first make this subdirectory in the folder where you are working:

% mkdir output

The intermediate results for each of the stages will be saved in files output/map1-c, output/reduce1-c, output/map2-c, and output/reduce2-c, where c ranges over alphabetic characters.

For each stage of both map and reduce, each worker writes to the same set of 26 output files, one for each alphabetic character. We set it up this way since we are running sequentially on a single machine, so the supplied code does not worry about creating separate output files for each worker and then combining them in between stages.

The provided skeleton file deals with breaking up inputs and farming them to multiple map and reduce workers, so you will most likely not need to look at this code closely. However, it may be useful to understand the format of passing data from one phase to another for your debugging purposes. The intermediate results stored in the output/ directory contain files with one key-value pair per line, formatted as


That is, three # characters are used to delimit the key from the value. Furthermore, both the keydata and valuedata are lists containing one or more elements, with the format


That is, each element in both the key and value is separated by a single # character. So, for example, for keys with two elements a and b and values with three elements x, y, and z, the format of the line containing this key-value pair would be


Since key-value pairs are stored in text files, all elements of a key and of a value are strings. Thus, if a map or reduce function needs to treat an element as something other than a string, it is the responsibility of the map or reduce function to perform the conversion from string to another type.

(a) Splitting input files (20 points)

Complete the function splitFile(filename, n) that takes a file filename and splits it up into smaller files, or chunks, each with at most n lines. These chunks should be saved in files named like output/filename-00, output/filename-01, and so on. The return value should be the number of chunks that the file was split into.

>>> import ff
>>> ff.clearOutputFolder()
>>> ff.splitFile("huckleberry", 1000)

Here's how you can use C-style format strings and all their glory in Python:

>>> "hello %s %.5d" % ("world", 10)
'hello world 00010'

The concatenation of these chunks should be exactly equal to the original file. Here are some sanity checks you can run from the shell:

% wc -l huckleberry
   11336 huckleberry
% wc -l output/huckleberry-*
    1000 output/huckleberry-00
    1000 output/huckleberry-01
    1000 output/huckleberry-02
    1000 output/huckleberry-03
    1000 output/huckleberry-04
    1000 output/huckleberry-05
    1000 output/huckleberry-06
    1000 output/huckleberry-07
    1000 output/huckleberry-08
    1000 output/huckleberry-09
    1000 output/huckleberry-10
     336 output/huckleberry-11
   11336 total

% cat output/huckleberry-* > tmp
% diff huckleberry tmp
% rm tmp

(b) Phase 1: Map (20 points)

Complete the function map1(inKey, inVal) that should return a list of pairs (outKey, outVal), where The skeleton code provided preprocesses the line from the text file to remove all characters except for alpha-numerics and quotes, and converts all words to lower case.

The only five-word sequences that are valid for this assignment are those in which the first letter of each word is an alphabetic character.

>>> l = ff.map1(["huckleberry","56"], ["YOU don't know about me without you have read a book by the name of The"])
>>> for x in l:
...     print x
(['you', 'dont', 'know', 'about', 'me'], ['1', 'huckleberry', '56'])
(['dont', 'know', 'about', 'me', 'without'], ['1', 'huckleberry', '56'])
(['know', 'about', 'me', 'without', 'you'], ['1', 'huckleberry', '56'])
(['about', 'me', 'without', 'you', 'have'], ['1', 'huckleberry', '56'])
(['me', 'without', 'you', 'have', 'read'], ['1', 'huckleberry', '56'])
(['without', 'you', 'have', 'read', 'a'], ['1', 'huckleberry', '56'])
(['you', 'have', 'read', 'a', 'book'], ['1', 'huckleberry', '56'])
(['have', 'read', 'a', 'book', 'by'], ['1', 'huckleberry', '56'])
(['read', 'a', 'book', 'by', 'the'], ['1', 'huckleberry', '56'])
(['a', 'book', 'by', 'the', 'name'], ['1', 'huckleberry', '56'])
(['book', 'by', 'the', 'name', 'of'], ['1', 'huckleberry', '56'])
(['by', 'the', 'name', 'of', 'the'], ['1', 'huckleberry', '56'])

>>> l = ff.map1(["blah","1"], ["one two three four five six seven 8ight 9ine 10en 11even twelve thirteen fourteen fifteen sixteen"])
>>> for x in l:
...     print x
(['one', 'two', 'three', 'four', 'five'], ['1', 'blah', '1'])
(['two', 'three', 'four', 'five', 'six'], ['1', 'blah', '1'])
(['three', 'four', 'five', 'six', 'seven'], ['1', 'blah', '1'])
(['twelve', 'thirteen', 'fourteen', 'fifteen', 'sixteen'], ['1', 'blah', '1'])

(c) Phase 1: Reduce (20 points)

Complete the function reduce1(inKey, inVals) that should return a single element outVal, where When choosing which input file and line number to return, use the following strategy: first, choose the entry that has the smallest file name (normal string comparison); second, if there are ties between two entries in the same file, choose the one with the smallest line number (normal integer comparison).

>>> ff.reduce1(['away', 'in', 'the', 'night', 'and'], [['1', 'huckleberry', '2472'], ['1', 'huckleberry', '10928']])
['2', 'huckleberry', '2472']

(d) Phase 2: Map (20 points)

Complete the function map2(inKey, inVal) that should return a list with a single pair (outKey, outVal), where

>>> ff.map2(['cabin', 'again', 'on', 'bread', 'and'], ['1', 'huckleberry', '11154'])
[(['cabin'], ['again', 'on', 'bread', 'and', '1', 'huckleberry', '11154'])]

(e) Phase 2: Reduce (20 points)

Complete the function reduce2(inKey, inVals) that should return a single element outVal, where Again, there may be multiple file-location entries for the same five-word sequence. Use the same approach as before to break ties.

>>> ff.reduce2(['underneath'], [['i', 'didnt', 'know', 'how', '1', 'huckleberry', '202'], ['the', 'picture', 'it', 'said', '2', 'huckleberry', '3916']])
['the', 'picture', 'it', 'said', '2', 'huckleberry', '3916']

Problem #2 -- Sentence Generator (sg.py)

(a) (20 points)

Complete the function genSentence(w, c), which creates a sentence starting with the word w made up from c fragments found from the two MapReduce passes. You can use the function mostCommonFragment to get the final result from the second reduce phase.

>>> import sg
>>> sg.runBothPhases(1000, ["huckleberry"])
>>> sg.mostCommonFragment("my")
['boy', 'says', 'the', 'old', '2', 'huckleberry', '3781']

If there is no entry for a given word, mostCommonFragment returns None. If you ever need a fragment for a word but there is none, you should return None from genSentence. Otherwise, you should piece together the c fragments into a single string.

Now you can run both MapReduce phases and see what sentences you can construct!

>>> sg.genSentence("my", 1)
'my boy says the old'
>>> sg.genSentence("my", 2)
'my boy says the old rags and my sugar'
>>> sg.genSentence("my", 4)
'my boy says the old rags and my sugar there was and all the time and never'
>>> sg.genSentence("no", 4)
'no use to try to put in the time i didnt want to put in the time'
>>> sg.runBothPhases(1000, ["huckleberry", "gulliver"])
>>> sg.genSentence("my", 4)
'my body by the help me tow the raft nine logs fast together with the minute descriptions'