The data we'll read comes from https://s3.amazonaws.com/amazon-reviews-pds/tsv/index.txt
We'll read data from the "Gift card" category, which is fairly small. The raw data is here, and should be downloaded to your local machine: https://s3.amazonaws.com/amazon-reviews-pds/tsv/amazon_reviews_us_Gift_Card_v1_00.tsv.gz
path = "/home/jmcauley/datasets/mooc/amazon/amazon_reviews_us_Gift_Card_v1_00.tsv.gz"
Note that the data is gzipped (filetype .gz). Rather than unzipping it, we can use the "gzip" library to read zipped data directly from the file
import gzip
Using this library, we can open the data as if it were a regular file. "rt" converts from bytes to strings:
f = gzip.open(path, 'rt', encoding="utf8")
Let's look at one line of the file:
header = f.readline()
header
This line is called the "header." Note that it contains the names of the fields we expect to find in the file. These fields are separeted by tabs (\t) in a tsv file.
We can extract these fields to a list using the "split()" function, which separates the string on the tab character:
header = header.strip().split('\t')
header
We can now do the same thing to extract every line from the file, using a "for" loop:
lines = []
for line in f:
fields = line.split('\t')
lines.append(fields)
Let's look at the first line:
lines[0]
It's hard to keep track of what each field means, but note that each entry corresponds to one field from the header. Using the "zip" function, we can match the header columns to the corresponding columns of the data:
z = zip(header, lines[0])
list(z)
Note that this data is now essentially what is known as a "Key Value" pair, where the first entry is the key, and the second is the value
Python has a special data structure for dealing with key value pairs known as a "dictionary". This allows us to index the data using the keys directly. Let's convert this data to a dictionary:
d = dict(zip(header, lines[0]))
d
Now we can directly query any of the fields:
d['customer_id']
d['star_rating']
It might be useful to convert a few of the numerical fields from strings to integers:
d['star_rating'] = int(d['star_rating'])
d['helpful_votes'] = int(d['helpful_votes'])
d['total_votes'] = int(d['total_votes'])
Finally, let's do the same thing for every line in the file, to build our dataset
dataset = []
for line in lines:
# Convert to key-value pairs
d = dict(zip(header, line))
# Convert strings to integers for some fields:
d['star_rating'] = int(d['star_rating'])
d['helpful_votes'] = int(d['helpful_votes'])
d['total_votes'] = int(d['total_votes'])
dataset.append(d)
Now, we can easily perform queries on any entry in our dataset:
dataset[50]['star_rating']
Finally, while we've done these operations manually above, the same can be accomplished using the python csv library. Doing so saves us a few lines:
import csv
c = csv.reader(gzip.open(path, 'rt'), delimiter = '\t')
dataset = []
first = True
for line in c:
# The first line is the header
if first:
header = line
first = False
else:
d = dict(zip(header, line))
# Convert strings to integers for some fields:
d['star_rating'] = int(d['star_rating'])
d['helpful_votes'] = int(d['helpful_votes'])
d['total_votes'] = int(d['total_votes'])
dataset.append(d)
dataset[20]
Note that it can be rather costly (in terms of memory) to read the entire file into a data structure, when we may only need to manipulate a small part of it at any one time. So, rather than reading the entire dataset into a data structure, this time we'll want to perform pre-processing on the dataset as we read each line.
Let's suppose that for this exercise, we only care about extracting user, items, ratings, timestamps, and the "verified_purchase" flag:
dataset = []
header = f.readline().strip().split('\t')
for line in f:
line = line.split('\t')
d = dict(zip(header, line))
d['star_rating'] = int(d['star_rating'])
d2 = {}
for field in ['star_rating', 'customer_id', 'product_id', 'review_date', 'verified_purchase']:
d2[field] = d[field]
dataset.append(d2)
Let's quickly read the data again using the same code:
import gzip
path = "/home/jmcauley/datasets/mooc/amazon/amazon_reviews_us_Gift_Card_v1_00.tsv.gz"
f = gzip.open(path, 'rt')
dataset = []
# Read the header:
header = f.readline().strip().split('\t')
for line in f:
# Separate by tabs
line = line.split('\t')
# Convert to key-value pairs
d = dict(zip(header, line))
# Convert strings to integers for some fields:
d['star_rating'] = int(d['star_rating'])
d['helpful_votes'] = int(d['helpful_votes'])
d['total_votes'] = int(d['total_votes'])
dataset.append(d)
By iterating through our dataset, we can straightforwardly compute some simple statistics, e.g. how many ratings are there?
nRatings = len(dataset)
nRatings
And what is the average rating?
average = 0
for d in dataset:
average += d['star_rating']
average /= nRatings
average
How many unique users and products are there in this dataset?
users = set()
items = set()
for d in dataset:
users.add(d['customer_id'])
items.add(d['product_id'])
len(users),len(items)
avVerified = 0
avUnverified = 0
nVerified = 0
nUnverified = 0
for d in dataset:
if d['verified_purchase'] == 'Y':
avVerified += d['star_rating']
nVerified += 1
else:
avUnverified += d['star_rating']
nUnverified += 1
avVerified /= nVerified
avUnverified /= nUnverified
avVerified, avUnverified
Many of these types of operations can be done more easily using operations known as "list comprehensions", which allow us to process and filter the data:
verifiedRatings = [d['star_rating'] for d in dataset if d['verified_purchase'] == 'Y']
unverifiedRatings = [d['star_rating'] for d in dataset if d['verified_purchase'] == 'N']
print(sum(verifiedRatings) * 1.0 / len(verifiedRatings))
print(sum(unverifiedRatings) * 1.0 / len(unverifiedRatings))
Another common data format is JSON (https://www.json.org/). This format generalizes key-value pairs (like those that we saw in the previous notebooks), by allowing the values to also be key value pairs (allowing for hierarchical data).
Let's look at an example of such data, this time from Yelp. This data is part of the "Yelp dataset challenge" and should first be downloaded locally before beginning this notebook: https://www.yelp.com/dataset/download
path = "/home/jmcauley/datasets/mooc/yelp_data/review.json"
This file is very large -- for the moment let's just look at the first 50,000 lines
f = open(path, 'r')
lines = []
for i in range(50000):
lines.append(f.readline())
Let's just look at the first line:
lines[0]
Note that this looks very much like a python dictionary! In fact we could convert it directly to a python dictionary using the "eval" operator:
d = eval(lines[0])
d
Then we could treat it just like a key-value pair:
d['user_id']
d['stars']
The "eval" operator isn't the safest though -- it's basically executing the line of the file as if it were native python code. This is a dangerous thing to do, especially if we don't trust the source of the file we're using.
More safely, we can use the json library to read the data.
import json
and then read the data in the same way:
d = json.loads(lines[0])
d
Let's look at a different dataset, also from the yelp challenge. This time let's read the business metadata:
path = "/home/jmcauley/datasets/mooc/yelp_data/business.json"
f = open(path, 'r')
dataset = []
for i in range(50000):
dataset.append(json.loads(f.readline()))
dataset[0]
Again, each entry is a set of key-value, pairs, but note that some of the values are themselves key value pairs:
dataset[0]['attributes']
Note also that (at least in this dataset) numerical values are already formatted as ints/floats, and don't need to be converted:
dataset[0]['stars']
import json
path = "/home/jmcauley/datasets/mooc/yelp_data/review.json"
f = open(path, 'r')
dataset = []
for i in range(50000):
dataset.append(json.loads(f.readline()))
dataset[0]
Let's look at the first review's date:
timeString = dataset[0]['date']
print(timeString)
To handle the string-formatted time data, we can use python's "time" library:
import time
timeStruct = time.strptime(timeString, "%Y-%m-%d")
timeStruct
The above operation produced a data structure repreresting the time string. Note that we had to specify a format, here a four digit year, two digit month and day, separated by hyphens. Details about formatting strings can be found on https://docs.python.org/2/library/datetime.html. E.g. we could also have a string consisting of a time and date:
time.strptime("21:36:18, 28/5/2019", "%H:%M:%S, %d/%m/%Y")
The above time structure allows us to extract specific information about the date/timestamp, but easier yet might be to convert the timestamps into integers for easier comparison:
timeInt = time.mktime(timeStruct)
timeInt
The exact value of this time is the number of seconds since 1970. While not the most obvious format, it does allow for easy comparison between times:
timeInt2 = time.mktime(time.strptime(dataset[99]['date'], "%Y-%m-%d"))
E.g. the number of seconds between dataset[0] and dataset[99] is...
timeDiff = timeInt - timeInt2
timeDiff
timeDiff / 60 # minutes
timeDiff / (60*60) # hours
timeDiff / (60*60*24) # days
The values can also be converted back to a structured time object:
time.gmtime(timeInt)
Or we can determine, for example, what the date would be one week later:
time.gmtime(timeInt + 60*60*24*7)
Finally It might be useful to augment our dataset to include these numerical time measurements for every sample:
datasetWithTimeValues = []
for d in dataset:
d['date']
d['timeStruct'] = time.strptime(d['date'], "%Y-%m-%d")
d['timeInt'] = time.mktime(d['timeStruct'])
datasetWithTimeValues.append(d)
The same strategy can also be used to deal with times that have no date attached to them, e.g. let's look at the length of a business's opening hours:
path = "/home/jmcauley/datasets/mooc/yelp_data/business.json"
f = open(path, 'r')
dataset = []
for i in range(50000):
dataset.append(json.loads(f.readline()))
dataset[0]['hours']
Let's try to calculate how long the business is open on Fridays:
hours = dataset[0]['hours']['Friday']
hours
openTime,closeTime = hours.split('-')
openTime,closeTime
timeIntOpen = time.mktime(time.strptime(openTime, "%H:%M"))
timeIntClose = time.mktime(time.strptime(closeTime, "%H:%M"))
timeIntOpen
timeIntClose
Note that since we specified no date, these timestamps are assumed to correspond to January 1, 1900:
time.gmtime(timeIntOpen)
However for the purposes of measuring the opening duration, it's sufficient to compute the difference:
(timeIntClose - timeIntOpen) / (60*60)
Case Study: Amazon Dataset
We'll again use the the "Gift card" dataset. Please unzip the file after downloading it and place amazon_reviews_us_Gift_Card_v1_00.tsv under the datasets.
Objective: We seek to draw inferences from this dataset, whilst exploring some of the functionalities pandas has to offer.
import pandas as pd
giftcard = pd.read_csv('./datasets/amazon_reviews_us_Gift_Card_v1_00.tsv', sep='\t')
print(type(giftcard))
giftcard.head()
giftcard.describe()
giftcard['star_rating'].max()
giftcard['star_rating'].min()
giftcard['star_rating'].mean()
giftcard.isnull().any()
giftcard.shape
giftcard = giftcard.dropna()
giftcard.shape
df = giftcard[['star_rating','helpful_votes','total_votes']]
df
df.head(100).plot.bar()
df.head(100).plot.hist()
df.head(100).plot()
df_helpful_votes = df[df['helpful_votes'] > 0]
print(df.shape)
print(df_helpful_votes.shape)
df_helpful_votes
df_helpful_votes.groupby('total_votes').mean()
df_helpful_votes[['helpful_votes','total_votes']].head(50).plot.bar()
%matplotlib inline
giftcard.hist(column='star_rating', figsize=(15,10))
Let's do a simple plot of ratings as a function of the day of the week.
First we'll import the matplotlib library:
from matplotlib import pyplot as plt
from collections import defaultdict
weekRatings = defaultdict(list)
for d in datasetWithTimeValues:
day = d['timeStruct'].tm_wday
weekRatings[day].append(d['stars'])
weekAverages = {}
for d in weekRatings:
weekAverages[d] = sum(weekRatings[d]) * 1.0 / len(weekRatings[d])
weekAverages
X = list(weekAverages.keys())
Y = [weekAverages[x] for x in X]
plt.plot(X, Y)
It might be nicer to plot this as a bar plot:
plt.bar(X, Y)
Let's zoom in to make the differences a bit more visible:
plt.ylim(3.6, 3.8)
plt.bar(X, Y)
Next let's add some labels:
plt.ylim(3.6, 3.8)
plt.xlabel("Weekday")
plt.ylabel("Rating")
plt.title("Rating as a function of weekday")
plt.bar(X, Y)
Finally let's rename the ticks to correspond to the days of the week
plt.ylim(3.6, 3.8)
plt.xlabel("Weekday")
plt.ylabel("Rating")
plt.xticks([0,1,2,3,4,5,6],['S', 'M', 'T', 'W', 'T', 'F', 'S'])
plt.title("Rating as a function of weekday")
plt.bar(X, Y)