CSE 130 - Programming Assignment #7


120 points

Must be turned in no later than 4:59:59 PM on 3/5/2010
(see submission instructions below)

(click your browser's refresh button to ensure that you have the most recent version)

Note: To download and install Python version 2.6 on your home machines see this page. Remember that this is only to enable you to play with the assignment at home: The final version turned in must work on the ACS Linux machines. While you can use MacOS or Windows to begin working with Python, the code you turn in must be that required for the Linux environment.

Integrity of Scholarship

University rules on integrity of scholarship will be strictly enforced. By completing this assignment, you implicitly agree to abide by the UCSD Policy on Integrity of Scholarship described beginning on page 68 of the Academic Regulations section (PDF) of the 2007-2008 General Catalog, in particular, "all academic work will be done by the student to whom it is assigned, without unauthorized aid of any kind."

You are expected to do your own work on this assignment; there are no group projects in this course. You may (and are encouraged to) engage in general discussions with your classmates regarding the assignment, but specific details of a solution, including the solution itself, must always be your own work. Incidents that violate the University's rules on integrity of scholarship will be taken seriously: In addition to receiving a zero (0) on the assignment, students may also face other penalties, up to and including, expulsion from the University. Should you have any doubt about the moral and/or ethical implications of an activity associated with the completion of this assignment, please see the instructors.

Code Documentation and General Requirements

Code for all programming assignments should be well documented. A working program with no comments will receive only partial credit. Documentation entails providing documentation strings for all methods, classes, packages, etc., and comments throughout the code to explain the program logic. Comments in Python are preceded by # and extend to the end of the line. Documentation strings are strings in the first line of a function, method, etc., and are accessible using help(foo), where foo is the name of the method, class, etc. It is understood that some of the exercises in this programming assignment require extremely little code and will not require extensive comments.

While few programming assignments pretend to mimic the "real" world, they may, nevertheless, contain some of the ambiguity that exists outside the classroom. If, for example, an assignment is amenable to differing interpretations, such that more than one algorithm may implement a correct solution to the assignment, it is incumbent upon the programmer to document not only the functionality of the algorithm (and more broadly his/her interpretation of the program requirements), but to articulate clearly the reasoning behind a particular choice of solution.

Submission Instructions

1. Create the zip file for submission

Your solutions to this assignment will be stored in separate files under a directory called pa7_solution/, inside which you will place the files: ff.py and sg.py. These two files listed are the versions of the corresponding supplied files that you will have modified. There should be no other files in the directory.

After creating and populating the directory as described above, create a zip file called <LastName>_<FirstName>_pa7.zip by going into the directory pa7_solution and executing the UNIX shell command:

zip <LastName>_<FirstName>_pa7.zip ff.py sg.py

2. Submit the zip file via the turnin program

Once you've created the zip file with your solutions, you will use the turnin program to submit this file for grading by going into the directory pa7_solution/ and executing the UNIX shell command:

turnin -c cs130w -p pa7 <LastName>_<FirstName>_pa7.zip

The turnin program will provide you with a confirmation of the submission process; make sure that the size of the file indicated by turnin matches the size of your zip file. See the ACS Web page on turnin for more information on the operation of the program.

Assignment Overview

The goal of this problem is to familiarize you with the programming model behind Google's MapReduce framework. As we discussed in class, MapReduce is a programming paradigm inspired from functional programming that Google has developed in order to express their own internal computations over large data sets. Google's implementation of MapReduce is not available publicly, but Hadoop is an open-source version publicly available version of MapReduce written in Java.

In this assignment, we will use a small, very specialized version of MapReduce in Python to implement our desired algorithm, and we will run it on a single machine.

To start, you need to get familiar with the MapReduce programming model. Read the MapReduce paper, written by its inventors at Google. You only really need to read Sections 1 and 2, but if you are interested you can also read the rest.

For this assignment, we change the return type of reduce to only return a single value, which is the common usage as mentioned in the paper. So the types for map and reduce (using the notation from the paper) will look like
    map     (k1,v1)        ->  list(k2,v2)
    reduce  (k2,list(v2))  ->  v2
Download the files ff.py, sg.py, huckleberry, and gulliver. You will complete the function definitions wherever you see raise Exception("not implemented"). Some additional books to play with.

Problem #1 -- Fragment Finder (ff.py)

This problem consists of two MapReduce phases, with the goal of trying to randomly generate sentences that are close to making some sense. Once finished, you will then piece together five-word sequences to build sentences. For example, if some of the five-word sequences in the final output are "no use to try to", "to keep a journal on", "on tother side of the", "the first thing that come", then you could form the sentence "no use to try to keep a journal on tother side of the first thing that come".

Here's a more detailed overview of how the algorithm will proceed:

Intermediate results will be written in a subdirectory called output/, so first make this subdirectory in the folder where you are working:

% mkdir output

The intermediate results for each of the stages will be saved in files output/map1-c, output/reduce1-c, output/map2-c, and output/reduce2-c, where c ranges over alphabetic characters.

For each stage of both map and reduce, each worker writes to the same set of 26 output files, one for each alphabetic character. We set it up this way since we are running sequentially on a single machine, so the supplied code does not worry about creating separate output files for each worker and then combining them in between stages.

The provided skeleton file deals with breaking up inputs and farming them to multiple map and reduce workers, so you will most likely not need to look at this code closely. However, it may be useful to understand the format of passing data from one phase to another for your debugging purposes. The intermediate results stored in the output/ directory contain files with one key-value pair per line, formatted as


That is, three # characters are used to delimit the key from the value. Furthermore, both the keydata and valuedata are lists containing one or more elements, with the format


That is, each element in both the key and value is separated by a single # character. So, for example, for keys with two elements a and b and values with three elements x, y, and z, the format of the line containing this key-value pair would be


Since key-value pairs are stored in text files, all elements of a key and of a value are strings. Thus, if a map or reduce function needs to treat an element as something other than a string, it is the responsibility of the map or reduce function to perform the conversion from string to another type.

(a) Splitting input files (20 points)

Complete the function splitFile(filename, n) that takes a file filename and splits it up into smaller files, or chunks, each with at most n lines. These chunks should be saved in files named like output/filename-00, output/filename-01, and so on. The return value should be the number of chunks that the file was split into.

>>> import ff
>>> ff.clearOutputFolder()
>>> ff.splitFile("huckleberry", 1000)

Here's how you can use C-style format strings and all their glory in Python:

>>> "hello %s %.5d" % ("world", 10)
'hello world 00010'

The concatenation of these chunks should be exactly equal to the original file. Here are some sanity checks you can run from the shell:

% wc -l huckleberry
   11336 huckleberry
% wc -l output/huckleberry-*
    1000 output/huckleberry-00
    1000 output/huckleberry-01
    1000 output/huckleberry-02
    1000 output/huckleberry-03
    1000 output/huckleberry-04
    1000 output/huckleberry-05
    1000 output/huckleberry-06
    1000 output/huckleberry-07
    1000 output/huckleberry-08
    1000 output/huckleberry-09
    1000 output/huckleberry-10
     336 output/huckleberry-11
   11336 total

% cat output/huckleberry-* > tmp
% diff huckleberry tmp
% rm tmp

(b) Phase 1: Map (20 points)

Complete the function map1(inKey, inVal) that should return a list of pairs (outKey, outVal), where The skeleton code provided preprocesses the line from the text file to remove all characters except for alpha-numerics and quotes, and converts all words to lower case.

The only five-word sequences that are valid for this assignment are those in which the first letter of each word is an alphabetic character.

>>> l = ff.map1(["huckleberry","56"], ["YOU don't know about me without you have read a book by the name of The"])
>>> for x in l:
...     print x
(['you', 'dont', 'know', 'about', 'me'], ['1', 'huckleberry', '56'])
(['dont', 'know', 'about', 'me', 'without'], ['1', 'huckleberry', '56'])
(['know', 'about', 'me', 'without', 'you'], ['1', 'huckleberry', '56'])
(['about', 'me', 'without', 'you', 'have'], ['1', 'huckleberry', '56'])
(['me', 'without', 'you', 'have', 'read'], ['1', 'huckleberry', '56'])
(['without', 'you', 'have', 'read', 'a'], ['1', 'huckleberry', '56'])
(['you', 'have', 'read', 'a', 'book'], ['1', 'huckleberry', '56'])
(['have', 'read', 'a', 'book', 'by'], ['1', 'huckleberry', '56'])
(['read', 'a', 'book', 'by', 'the'], ['1', 'huckleberry', '56'])
(['a', 'book', 'by', 'the', 'name'], ['1', 'huckleberry', '56'])
(['book', 'by', 'the', 'name', 'of'], ['1', 'huckleberry', '56'])
(['by', 'the', 'name', 'of', 'the'], ['1', 'huckleberry', '56'])

>>> l = ff.map1(["blah","1"], ["one two three four five six seven 8ight 9ine 10en 11even twelve thirteen fourteen fifteen sixteen"])
>>> for x in l:
...     print x
(['one', 'two', 'three', 'four', 'five'], ['1', 'blah', '1'])
(['two', 'three', 'four', 'five', 'six'], ['1', 'blah', '1'])
(['three', 'four', 'five', 'six', 'seven'], ['1', 'blah', '1'])
(['twelve', 'thirteen', 'fourteen', 'fifteen', 'sixteen'], ['1', 'blah', '1'])

(c) Phase 1: Reduce (20 points)

Complete the function reduce1(inKey, inVals) that should return a single element outVal, where When choosing which input file and line number to return, use the following strategy: first, choose the entry that has the smallest file name (normal string comparison); second, if there are ties between two entries in the same file, choose the one with the smallest line number (normal integer comparison).

>>> ff.reduce1(['away', 'in', 'the', 'night', 'and'], [['1', 'huckleberry', '2472'], ['1', 'huckleberry', '10928']])
['2', 'huckleberry', '2472']

(d) Phase 2: Map (20 points)

Complete the function map2(inKey, inVal) that should return a list with a single pair (outKey, outVal), where

>>> ff.map2(['cabin', 'again', 'on', 'bread', 'and'], ['1', 'huckleberry', '11154'])
[(['cabin'], ['again', 'on', 'bread', 'and', '1', 'huckleberry', '11154'])]

(e) Phase 2: Reduce (20 points)

Complete the function reduce2(inKey, inVals) that should return a single element outVal, where Again, there may be multiple file-location entries for the same five-word sequence. Use the same approach as before to break ties.

>>> ff.reduce2(['underneath'], [['i', 'didnt', 'know', 'how', '1', 'huckleberry', '202'], ['the', 'picture', 'it', 'said', '2', 'huckleberry', '3916']])
['the', 'picture', 'it', 'said', '2', 'huckleberry', '3916']

Problem #2 -- Sentence Generator (sg.py)

(a) (20 points)

Complete the function genSentence(w, c), which creates a sentence starting with the word w made up from c fragments found from the two MapReduce passes. You can use the function mostCommonFragment to get the final result from the second reduce phase.

>>> import sg
>>> sg.runBothPhases(1000, ["huckleberry"])
>>> sg.mostCommonFragment("my")
['boy', 'says', 'the', 'old', '2', 'huckleberry', '3781']

If there is no entry for a given word, mostCommonFragment returns None. If you ever need a fragment for a word but there is none, you should return None from genSentence. Otherwise, you should piece together the c fragments into a single string.

Now you can run both MapReduce phases and see what sentences you can construct!

>>> sg.genSentence("my", 1)
'my boy says the old'
>>> sg.genSentence("my", 2)
'my boy says the old rags and my sugar'
>>> sg.genSentence("my", 4)
'my boy says the old rags and my sugar there was and all the time and never'
>>> sg.genSentence("no", 4)
'no use to try to put in the time i didnt want to put in the time'
>>> sg.runBothPhases(1000, ["huckleberry", "gulliver"])
>>> sg.genSentence("my", 4)
'my body by the help me tow the raft nine logs fast together with the minute descriptions'