This section presents an example MapReduce task to calculate the average movie
rating per user in the dataset.
For the average value of a user rating two elements must be captured per user
rating, the rating value and the sum of the ratings. This is not as
straightforward as it may seem. Because the map task will be working in
parallel on separate chunks of the ratings document and will have no higher
level knowledge of which ratings and users have been emitted the simplest
possible emit must be chosen that will still allow the reducer to aggregate the
sum of the ratings. In order to accomplish this, for every rating there will be
an emit for the UserID consisting of a two element list which contains the
rating and the sum of the rating. Because the map task works independently with
each rating the sum will always be emitted as a 1.
High level emit:
CSV Data file line -> <UserID, <rating value, rating sum>>
<1623180, <5, 1>>
<282486, <3, 1>>
<1987434, <4, 1>>
The Reduce task then encompasses the function which iterates over the <rating
value, rating sum> list for each user and aggregates the rating and sum. The
final operation of the reduce task is to calculate the average rating using the
aggregated sum and rating values and emit the average rating for the user.
value, rating sum>
High level emit:
<UserID, <rating value/sum list>> -> <user, average rating>
<1623180, <<5, 1>, <4, 1>, <1, 1>, <4, 1>, ... >
First we must compile our job and package it into a jar.
$ javac -cp "/usr/lib/hadoop/client/*" *.java
$ jar cvf example.jar *.class
Then we copy the input files from our local storage to hdfs.
$ hadoop fs -copyFromLocal input input
We are then ready to run the job.
$ hadoop jar example.jar HadoopDriver input output
Once the job is done, we can copy out the output files.
$ hadoop fs -copyToLocal output output
Adapted from California Polytechnic State University's 2008 "Laboratory
Assignment: MapReduce Paradigm in a VirtualGrid Environment"
Licensed under Creative Commons Attribution 3.0 License -