Reading CSV/TSV

We'll read data from the "Gift card" category, which is fairly small. The raw data is here, and should be downloaded to your local machine: https://s3.amazonaws.com/amazon-reviews-pds/tsv/amazon_reviews_us_Gift_Card_v1_00.tsv.gz

In [1]:
path = "/home/jmcauley/datasets/mooc/amazon/amazon_reviews_us_Gift_Card_v1_00.tsv.gz"

Note that the data is gzipped (filetype .gz). Rather than unzipping it, we can use the "gzip" library to read zipped data directly from the file

In [2]:
import gzip

Using this library, we can open the data as if it were a regular file. "rt" converts from bytes to strings:

In [3]:
f = gzip.open(path, 'rt', encoding="utf8")

Let's look at one line of the file:

In [4]:
header = f.readline()
In [5]:
header
Out[5]:
'marketplace\tcustomer_id\treview_id\tproduct_id\tproduct_parent\tproduct_title\tproduct_category\tstar_rating\thelpful_votes\ttotal_votes\tvine\tverified_purchase\treview_headline\treview_body\treview_date\n'

This line is called the "header." Note that it contains the names of the fields we expect to find in the file. These fields are separeted by tabs (\t) in a tsv file.

We can extract these fields to a list using the "split()" function, which separates the string on the tab character:

In [6]:
header = header.strip().split('\t')
In [7]:
header
Out[7]:
['marketplace',
 'customer_id',
 'review_id',
 'product_id',
 'product_parent',
 'product_title',
 'product_category',
 'star_rating',
 'helpful_votes',
 'total_votes',
 'vine',
 'verified_purchase',
 'review_headline',
 'review_body',
 'review_date']

We can now do the same thing to extract every line from the file, using a "for" loop:

In [8]:
lines = []
In [9]:
for line in f:
    fields = line.split('\t')
    lines.append(fields)

Let's look at the first line:

In [10]:
lines[0]
Out[10]:
['US',
 '24371595',
 'R27ZP1F1CD0C3Y',
 'B004LLIL5A',
 '346014806',
 'Amazon eGift Card - Celebrate',
 'Gift Card',
 '5',
 '0',
 '0',
 'N',
 'Y',
 'Five Stars',
 'Great birthday gift for a young adult.',
 '2015-08-31\n']

It's hard to keep track of what each field means, but note that each entry corresponds to one field from the header. Using the "zip" function, we can match the header columns to the corresponding columns of the data:

In [11]:
z = zip(header, lines[0])
list(z)
Out[11]:
[('marketplace', 'US'),
 ('customer_id', '24371595'),
 ('review_id', 'R27ZP1F1CD0C3Y'),
 ('product_id', 'B004LLIL5A'),
 ('product_parent', '346014806'),
 ('product_title', 'Amazon eGift Card - Celebrate'),
 ('product_category', 'Gift Card'),
 ('star_rating', '5'),
 ('helpful_votes', '0'),
 ('total_votes', '0'),
 ('vine', 'N'),
 ('verified_purchase', 'Y'),
 ('review_headline', 'Five Stars'),
 ('review_body', 'Great birthday gift for a young adult.'),
 ('review_date', '2015-08-31\n')]

Note that this data is now essentially what is known as a "Key Value" pair, where the first entry is the key, and the second is the value

Python has a special data structure for dealing with key value pairs known as a "dictionary". This allows us to index the data using the keys directly. Let's convert this data to a dictionary:

In [12]:
d = dict(zip(header, lines[0]))
d
Out[12]:
{'customer_id': '24371595',
 'helpful_votes': '0',
 'marketplace': 'US',
 'product_category': 'Gift Card',
 'product_id': 'B004LLIL5A',
 'product_parent': '346014806',
 'product_title': 'Amazon eGift Card - Celebrate',
 'review_body': 'Great birthday gift for a young adult.',
 'review_date': '2015-08-31\n',
 'review_headline': 'Five Stars',
 'review_id': 'R27ZP1F1CD0C3Y',
 'star_rating': '5',
 'total_votes': '0',
 'verified_purchase': 'Y',
 'vine': 'N'}

Now we can directly query any of the fields:

In [13]:
d['customer_id']
Out[13]:
'24371595'
In [14]:
d['star_rating']
Out[14]:
'5'

It might be useful to convert a few of the numerical fields from strings to integers:

In [15]:
d['star_rating'] = int(d['star_rating'])
d['helpful_votes'] = int(d['helpful_votes'])
d['total_votes'] = int(d['total_votes'])

Finally, let's do the same thing for every line in the file, to build our dataset

In [16]:
dataset = []
In [17]:
for line in lines:
    # Convert to key-value pairs
    d = dict(zip(header, line))
    # Convert strings to integers for some fields:
    d['star_rating'] = int(d['star_rating'])
    d['helpful_votes'] = int(d['helpful_votes'])
    d['total_votes'] = int(d['total_votes'])
    dataset.append(d)

Now, we can easily perform queries on any entry in our dataset:

In [18]:
dataset[50]['star_rating']
Out[18]:
5

Finally, while we've done these operations manually above, the same can be accomplished using the python csv library. Doing so saves us a few lines:

In [19]:
import csv
In [20]:
c = csv.reader(gzip.open(path, 'rt'), delimiter = '\t')
dataset = []
In [21]:
first = True
for line in c:
    # The first line is the header
    if first:
        header = line
        first = False
    else:
        d = dict(zip(header, line))
        # Convert strings to integers for some fields:
        d['star_rating'] = int(d['star_rating'])
        d['helpful_votes'] = int(d['helpful_votes'])
        d['total_votes'] = int(d['total_votes'])
        dataset.append(d)
In [22]:
dataset[20]
Out[22]:
{'customer_id': '5539254',
 'helpful_votes': 1,
 'marketplace': 'US',
 'product_category': 'Gift Card',
 'product_id': 'B00EPLT448',
 'product_parent': '298151217',
 'product_title': 'Amazon Gift Card - Print - 16th Birthday (Sweet 16)',
 'review_body': 'my nice was quite suprised that it was a card and gift card all in one.',
 'review_date': '2015-08-31',
 'review_headline': 'Four Stars',
 'review_id': 'RDSFSZDZ0P6QH',
 'star_rating': 4,
 'total_votes': 1,
 'verified_purchase': 'N',
 'vine': 'N'}

Avoiding reading large files into disk

Note that it can be rather costly (in terms of memory) to read the entire file into a data structure, when we may only need to manipulate a small part of it at any one time. So, rather than reading the entire dataset into a data structure, this time we'll want to perform pre-processing on the dataset as we read each line.

Let's suppose that for this exercise, we only care about extracting user, items, ratings, timestamps, and the "verified_purchase" flag:

In [23]:
dataset = []
header = f.readline().strip().split('\t')
for line in f:
    line = line.split('\t')
    d = dict(zip(header, line))
    d['star_rating'] = int(d['star_rating'])
    d2 = {}
    for field in ['star_rating', 'customer_id', 'product_id', 'review_date', 'verified_purchase']:
        d2[field] = d[field]
    dataset.append(d2)

Computing simple statistics from data

Let's quickly read the data again using the same code:

In [24]:
import gzip
path = "/home/jmcauley/datasets/mooc/amazon/amazon_reviews_us_Gift_Card_v1_00.tsv.gz"
f = gzip.open(path, 'rt')
In [25]:
dataset = []
# Read the header:
header = f.readline().strip().split('\t')
for line in f:
    # Separate by tabs
    line = line.split('\t')
    # Convert to key-value pairs
    d = dict(zip(header, line))
    # Convert strings to integers for some fields:
    d['star_rating'] = int(d['star_rating'])
    d['helpful_votes'] = int(d['helpful_votes'])
    d['total_votes'] = int(d['total_votes'])
    dataset.append(d)

By iterating through our dataset, we can straightforwardly compute some simple statistics, e.g. how many ratings are there?

In [26]:
nRatings = len(dataset)
nRatings
Out[26]:
149086

And what is the average rating?

In [27]:
average = 0
for d in dataset:
    average += d['star_rating']
average /= nRatings
average
Out[27]:
4.731363105858364

How many unique users and products are there in this dataset?

In [28]:
users = set()
items = set()
for d in dataset:
    users.add(d['customer_id'])
    items.add(d['product_id'])

len(users),len(items)
Out[28]:
(143181, 1780)

E.g. What is the average rating of a verified purchase, versus an unverified purchase?

In [29]:
avVerified = 0
avUnverified = 0
nVerified = 0
nUnverified = 0
for d in dataset:
    if d['verified_purchase'] == 'Y':
        avVerified += d['star_rating']
        nVerified += 1
    else:
        avUnverified += d['star_rating']
        nUnverified += 1

avVerified /= nVerified
avUnverified /= nUnverified
avVerified, avUnverified
Out[29]:
(4.7461078196439335, 4.577583563324134)

Many of these types of operations can be done more easily using operations known as "list comprehensions", which allow us to process and filter the data:

In [30]:
verifiedRatings = [d['star_rating'] for d in dataset if d['verified_purchase'] == 'Y']
unverifiedRatings = [d['star_rating'] for d in dataset if d['verified_purchase'] == 'N']
In [31]:
print(sum(verifiedRatings) * 1.0 / len(verifiedRatings))
print(sum(unverifiedRatings) * 1.0 / len(unverifiedRatings))
4.7461078196439335
4.577583563324134

Reading data from JSON

Another common data format is JSON (https://www.json.org/). This format generalizes key-value pairs (like those that we saw in the previous notebooks), by allowing the values to also be key value pairs (allowing for hierarchical data).

Let's look at an example of such data, this time from Yelp. This data is part of the "Yelp dataset challenge" and should first be downloaded locally before beginning this notebook: https://www.yelp.com/dataset/download

In [32]:
path = "/home/jmcauley/datasets/mooc/yelp_data/review.json"

This file is very large -- for the moment let's just look at the first 50,000 lines

In [33]:
f = open(path, 'r')
In [34]:
lines = []
for i in range(50000):
    lines.append(f.readline())

Let's just look at the first line:

In [35]:
lines[0]
Out[35]:
'{"review_id":"v0i_UHJMo_hPBq9bxWvW4w","user_id":"bv2nCi5Qv5vroFiqKGopiw","business_id":"0W4lkclzZThpx3V65bVgig","stars":5,"date":"2016-05-28","text":"Love the staff, love the meat, love the place. Prepare for a long line around lunch or dinner hours. \\n\\nThey ask you how you want you meat, lean or something maybe, I can\'t remember. Just say you don\'t want it too fatty. \\n\\nGet a half sour pickle and a hot pepper. Hand cut french fries too.","useful":0,"funny":0,"cool":0}\n'

Note that this looks very much like a python dictionary! In fact we could convert it directly to a python dictionary using the "eval" operator:

In [36]:
d = eval(lines[0])
d
Out[36]:
{'business_id': '0W4lkclzZThpx3V65bVgig',
 'cool': 0,
 'date': '2016-05-28',
 'funny': 0,
 'review_id': 'v0i_UHJMo_hPBq9bxWvW4w',
 'stars': 5,
 'text': "Love the staff, love the meat, love the place. Prepare for a long line around lunch or dinner hours. \n\nThey ask you how you want you meat, lean or something maybe, I can't remember. Just say you don't want it too fatty. \n\nGet a half sour pickle and a hot pepper. Hand cut french fries too.",
 'useful': 0,
 'user_id': 'bv2nCi5Qv5vroFiqKGopiw'}

Then we could treat it just like a key-value pair:

In [37]:
d['user_id']
Out[37]:
'bv2nCi5Qv5vroFiqKGopiw'
In [38]:
d['stars']
Out[38]:
5

The "eval" operator isn't the safest though -- it's basically executing the line of the file as if it were native python code. This is a dangerous thing to do, especially if we don't trust the source of the file we're using.

More safely, we can use the json library to read the data.

In [39]:
import json

and then read the data in the same way:

In [40]:
d = json.loads(lines[0])
d
Out[40]:
{'business_id': '0W4lkclzZThpx3V65bVgig',
 'cool': 0,
 'date': '2016-05-28',
 'funny': 0,
 'review_id': 'v0i_UHJMo_hPBq9bxWvW4w',
 'stars': 5,
 'text': "Love the staff, love the meat, love the place. Prepare for a long line around lunch or dinner hours. \n\nThey ask you how you want you meat, lean or something maybe, I can't remember. Just say you don't want it too fatty. \n\nGet a half sour pickle and a hot pepper. Hand cut french fries too.",
 'useful': 0,
 'user_id': 'bv2nCi5Qv5vroFiqKGopiw'}

Let's look at a different dataset, also from the yelp challenge. This time let's read the business metadata:

In [41]:
path = "/home/jmcauley/datasets/mooc/yelp_data/business.json"
f = open(path, 'r')
In [42]:
dataset = []
for i in range(50000):
    dataset.append(json.loads(f.readline()))
In [43]:
dataset[0]
Out[43]:
{'address': '4855 E Warner Rd, Ste B9',
 'attributes': {'AcceptsInsurance': True,
  'BusinessAcceptsCreditCards': True,
  'ByAppointmentOnly': True},
 'business_id': 'FYWN1wneV18bWNgQjJ2GNg',
 'categories': ['Dentists',
  'General Dentistry',
  'Health & Medical',
  'Oral Surgeons',
  'Cosmetic Dentists',
  'Orthodontists'],
 'city': 'Ahwatukee',
 'hours': {'Friday': '7:30-17:00',
  'Monday': '7:30-17:00',
  'Thursday': '7:30-17:00',
  'Tuesday': '7:30-17:00',
  'Wednesday': '7:30-17:00'},
 'is_open': 1,
 'latitude': 33.3306902,
 'longitude': -111.9785992,
 'name': 'Dental by Design',
 'neighborhood': '',
 'postal_code': '85044',
 'review_count': 22,
 'stars': 4.0,
 'state': 'AZ'}

Again, each entry is a set of key-value, pairs, but note that some of the values are themselves key value pairs:

In [44]:
dataset[0]['attributes']
Out[44]:
{'AcceptsInsurance': True,
 'BusinessAcceptsCreditCards': True,
 'ByAppointmentOnly': True}

Note also that (at least in this dataset) numerical values are already formatted as ints/floats, and don't need to be converted:

In [45]:
dataset[0]['stars']
Out[45]:
4.0

Time and date data

In [46]:
import json
path = "/home/jmcauley/datasets/mooc/yelp_data/review.json"
f = open(path, 'r')
In [47]:
dataset = []
for i in range(50000):
    dataset.append(json.loads(f.readline()))
In [48]:
dataset[0]
Out[48]:
{'business_id': '0W4lkclzZThpx3V65bVgig',
 'cool': 0,
 'date': '2016-05-28',
 'funny': 0,
 'review_id': 'v0i_UHJMo_hPBq9bxWvW4w',
 'stars': 5,
 'text': "Love the staff, love the meat, love the place. Prepare for a long line around lunch or dinner hours. \n\nThey ask you how you want you meat, lean or something maybe, I can't remember. Just say you don't want it too fatty. \n\nGet a half sour pickle and a hot pepper. Hand cut french fries too.",
 'useful': 0,
 'user_id': 'bv2nCi5Qv5vroFiqKGopiw'}

Let's look at the first review's date:

In [49]:
timeString = dataset[0]['date']
print(timeString)
2016-05-28

To handle the string-formatted time data, we can use python's "time" library:

In [50]:
import time
In [51]:
timeStruct = time.strptime(timeString, "%Y-%m-%d")
timeStruct
Out[51]:
time.struct_time(tm_year=2016, tm_mon=5, tm_mday=28, tm_hour=0, tm_min=0, tm_sec=0, tm_wday=5, tm_yday=149, tm_isdst=-1)

The above operation produced a data structure repreresting the time string. Note that we had to specify a format, here a four digit year, two digit month and day, separated by hyphens. Details about formatting strings can be found on https://docs.python.org/2/library/datetime.html. E.g. we could also have a string consisting of a time and date:

In [52]:
time.strptime("21:36:18, 28/5/2019", "%H:%M:%S, %d/%m/%Y")
Out[52]:
time.struct_time(tm_year=2019, tm_mon=5, tm_mday=28, tm_hour=21, tm_min=36, tm_sec=18, tm_wday=1, tm_yday=148, tm_isdst=-1)

The above time structure allows us to extract specific information about the date/timestamp, but easier yet might be to convert the timestamps into integers for easier comparison:

In [53]:
timeInt = time.mktime(timeStruct)
timeInt
Out[53]:
1464418800.0

The exact value of this time is the number of seconds since 1970. While not the most obvious format, it does allow for easy comparison between times:

In [54]:
timeInt2 = time.mktime(time.strptime(dataset[99]['date'], "%Y-%m-%d"))

E.g. the number of seconds between dataset[0] and dataset[99] is...

In [55]:
timeDiff = timeInt - timeInt2
timeDiff
Out[55]:
125712000.0
In [56]:
timeDiff / 60 # minutes
Out[56]:
2095200.0
In [57]:
timeDiff / (60*60) # hours
Out[57]:
34920.0
In [58]:
timeDiff / (60*60*24) # days
Out[58]:
1455.0

The values can also be converted back to a structured time object:

In [59]:
time.gmtime(timeInt)
Out[59]:
time.struct_time(tm_year=2016, tm_mon=5, tm_mday=28, tm_hour=7, tm_min=0, tm_sec=0, tm_wday=5, tm_yday=149, tm_isdst=0)

Or we can determine, for example, what the date would be one week later:

In [60]:
time.gmtime(timeInt + 60*60*24*7)
Out[60]:
time.struct_time(tm_year=2016, tm_mon=6, tm_mday=4, tm_hour=7, tm_min=0, tm_sec=0, tm_wday=5, tm_yday=156, tm_isdst=0)

Finally It might be useful to augment our dataset to include these numerical time measurements for every sample:

In [61]:
datasetWithTimeValues = []
The history saving thread hit an unexpected error (OperationalError('database is locked',)).History will not be written to the database.
In [62]:
for d in dataset:
    d['date']
    d['timeStruct'] = time.strptime(d['date'], "%Y-%m-%d")
    d['timeInt'] = time.mktime(d['timeStruct'])
    datasetWithTimeValues.append(d)

The same strategy can also be used to deal with times that have no date attached to them, e.g. let's look at the length of a business's opening hours:

In [63]:
path = "/home/jmcauley/datasets/mooc/yelp_data/business.json"
f = open(path, 'r')
dataset = []
for i in range(50000):
    dataset.append(json.loads(f.readline()))
In [64]:
dataset[0]['hours']
Out[64]:
{'Friday': '7:30-17:00',
 'Monday': '7:30-17:00',
 'Thursday': '7:30-17:00',
 'Tuesday': '7:30-17:00',
 'Wednesday': '7:30-17:00'}

Let's try to calculate how long the business is open on Fridays:

In [65]:
hours = dataset[0]['hours']['Friday']
hours
Out[65]:
'7:30-17:00'
In [66]:
openTime,closeTime = hours.split('-')
openTime,closeTime
Out[66]:
('7:30', '17:00')
In [67]:
timeIntOpen = time.mktime(time.strptime(openTime, "%H:%M"))
timeIntClose = time.mktime(time.strptime(closeTime, "%H:%M"))
In [68]:
timeIntOpen
Out[68]:
-2208933000.0
In [69]:
timeIntClose
Out[69]:
-2208898800.0

Note that since we specified no date, these timestamps are assumed to correspond to January 1, 1900:

In [70]:
time.gmtime(timeIntOpen)
Out[70]:
time.struct_time(tm_year=1900, tm_mon=1, tm_mday=1, tm_hour=15, tm_min=30, tm_sec=0, tm_wday=0, tm_yday=1, tm_isdst=0)

However for the purposes of measuring the opening duration, it's sufficient to compute the difference:

In [71]:
(timeIntClose - timeIntOpen) / (60*60)
Out[71]:
9.5

Using pandas for Data Analysis

Case Study: Amazon Dataset

We'll again use the the "Gift card" dataset. Please unzip the file after downloading it and place amazon_reviews_us_Gift_Card_v1_00.tsv under the datasets.

Objective: We seek to draw inferences from this dataset, whilst exploring some of the functionalities pandas has to offer.

In [1]:
import pandas as pd
giftcard = pd.read_csv('./datasets/amazon_reviews_us_Gift_Card_v1_00.tsv', sep='\t')
print(type(giftcard))
giftcard.head()
<class 'pandas.core.frame.DataFrame'>
Out[1]:
marketplace customer_id review_id product_id product_parent product_title product_category star_rating helpful_votes total_votes vine verified_purchase review_headline review_body review_date
0 US 24371595 R27ZP1F1CD0C3Y B004LLIL5A 346014806 Amazon eGift Card - Celebrate Gift Card 5 0 0 N Y Five Stars Great birthday gift for a young adult. 2015-08-31
1 US 42489718 RJ7RSBCHUDNNE B004LLIKVU 473048287 Amazon.com eGift Cards Gift Card 5 0 0 N Y Gift card for the greatest selection of items ... It's an Amazon gift card and with over 9823983... 2015-08-31
2 US 861463 R1HVYBSKLQJI5S B00IX1I3G6 926539283 Amazon.com Gift Card Balance Reload Gift Card 5 0 0 N Y Five Stars Good 2015-08-31
3 US 25283295 R2HAXF0IIYQBIR B00IX1I3G6 926539283 Amazon.com Gift Card Balance Reload Gift Card 1 0 0 N Y One Star Fair 2015-08-31
4 US 397970 RNYLPX611NB7Q B005ESMGV4 379368939 Amazon.com Gift Cards, Pack of 3 (Various Desi... Gift Card 5 0 0 N Y Five Stars I can't believe how quickly Amazon can get the... 2015-08-31
In [2]:
giftcard.describe()
Out[2]:
customer_id product_parent star_rating helpful_votes total_votes
count 1.483100e+05 1.483100e+05 148310.000000 148310.000000 148310.000000
mean 2.628931e+07 5.406163e+08 4.731333 0.397424 0.490493
std 1.587236e+07 2.661563e+08 0.829255 20.701385 22.823494
min 1.063700e+04 1.100879e+06 1.000000 0.000000 0.000000
25% 1.289732e+07 3.612555e+08 5.000000 0.000000 0.000000
50% 2.499530e+07 4.730483e+08 5.000000 0.000000 0.000000
75% 4.139731e+07 7.754865e+08 5.000000 0.000000 0.000000
max 5.309648e+07 9.992742e+08 5.000000 5987.000000 6323.000000
In [3]:
giftcard['star_rating'].max()
Out[3]:
5
In [4]:
giftcard['star_rating'].min()
Out[4]:
1
In [5]:
giftcard['star_rating'].mean()
Out[5]:
4.731333018677096
In [6]:
giftcard.isnull().any()
Out[6]:
marketplace          False
customer_id          False
review_id            False
product_id           False
product_parent       False
product_title        False
product_category     False
star_rating          False
helpful_votes        False
total_votes          False
vine                 False
verified_purchase    False
review_headline       True
review_body           True
review_date           True
dtype: bool
In [7]:
giftcard.shape
Out[7]:
(148310, 15)
In [8]:
giftcard = giftcard.dropna()
In [9]:
giftcard.shape
Out[9]:
(148304, 15)
In [10]:
df = giftcard[['star_rating','helpful_votes','total_votes']]
df
Out[10]:
star_rating helpful_votes total_votes
0 5 0 0
1 5 0 0
2 5 0 0
3 1 0 0
4 5 0 0
5 5 0 0
6 5 0 0
7 5 0 0
8 1 0 0
9 5 0 0
10 5 0 0
11 5 1 1
12 5 1 1
13 5 0 0
14 5 1 1
15 5 0 0
16 5 0 0
17 4 0 0
18 5 0 0
19 5 1 1
20 4 1 1
21 5 0 0
22 5 0 0
23 5 0 0
24 5 0 0
25 5 0 0
26 5 0 0
27 5 0 0
28 5 0 0
29 5 0 0
... ... ... ...
148280 3 8 71
148281 5 42 47
148282 5 6 6
148283 3 1 9
148284 1 4 20
148285 5 4 9
148286 1 10 30
148287 5 6 8
148288 5 8 10
148289 1 34 54
148290 5 5 7
148291 4 10 11
148292 1 22 28
148293 5 4 6
148294 4 18 27
148295 5 21 22
148296 1 10 29
148297 2 2 38
148298 5 2 7
148299 5 11 12
148300 1 9 32
148301 5 105 107
148302 1 10 32
148303 1 6 17
148304 4 3 9
148305 5 10 10
148306 4 8 44
148307 5 20 30
148308 4 63 72
148309 5 26 29

148304 rows × 3 columns

In [11]:
df.head(100).plot.bar()
Out[11]:
<matplotlib.axes._subplots.AxesSubplot at 0x10947a3c8>
In [12]:
df.head(100).plot.hist()
Out[12]:
<matplotlib.axes._subplots.AxesSubplot at 0x10afed5c0>
In [13]:
df.head(100).plot()
Out[13]:
<matplotlib.axes._subplots.AxesSubplot at 0x10de32f98>
In [14]:
df_helpful_votes = df[df['helpful_votes'] > 0]
print(df.shape)
print(df_helpful_votes.shape)
(148304, 3)
(5548, 3)
In [15]:
df_helpful_votes
df_helpful_votes.groupby('total_votes').mean()
Out[15]:
star_rating helpful_votes
total_votes
1 4.066760 1.000000
2 3.463059 1.565036
3 3.106494 2.174026
4 2.843750 2.897321
5 2.602339 3.666667
6 2.184783 4.521739
7 2.250000 5.062500
8 1.760563 5.408451
9 2.204082 6.795918
10 2.260870 7.282609
11 2.000000 7.647059
12 2.387097 8.935484
13 1.952381 9.142857
14 1.961538 11.461538
15 2.400000 11.450000
16 1.666667 12.888889
17 1.769231 10.923077
18 2.173913 14.521739
19 2.647059 14.117647
20 2.052632 14.473684
21 2.111111 14.555556
22 2.133333 15.666667
23 1.812500 18.125000
24 2.777778 20.000000
25 1.750000 19.375000
26 2.428571 20.714286
27 2.500000 16.500000
28 2.928571 21.785714
29 1.545455 17.636364
30 2.000000 21.600000
... ... ...
354 1.000000 325.000000
367 3.000000 341.000000
369 1.000000 305.000000
436 3.000000 359.000000
441 1.000000 404.000000
443 5.000000 353.000000
454 1.000000 404.000000
460 4.000000 367.000000
466 2.000000 379.000000
515 1.000000 422.000000
545 1.000000 452.000000
549 1.000000 441.000000
557 1.000000 501.000000
615 2.000000 531.000000
633 1.000000 572.000000
688 1.000000 576.000000
744 1.000000 627.000000
751 5.000000 629.000000
778 1.000000 629.000000
827 5.000000 694.000000
1028 1.000000 875.000000
1038 1.000000 948.000000
1111 2.000000 902.000000
1113 5.000000 892.000000
1303 1.000000 1208.000000
1433 1.000000 1202.000000
2253 5.000000 2004.000000
2557 2.000000 2231.000000
2763 4.000000 2383.000000
6323 1.000000 5987.000000

181 rows × 2 columns

In [16]:
df_helpful_votes[['helpful_votes','total_votes']].head(50).plot.bar()
Out[16]:
<matplotlib.axes._subplots.AxesSubplot at 0x10dfaa438>
In [17]:
%matplotlib inline
giftcard.hist(column='star_rating', figsize=(15,10))
Out[17]:
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x10e59a048>]],
      dtype=object)

Plotting with matplotlib

Let's do a simple plot of ratings as a function of the day of the week.

First we'll import the matplotlib library:

In [72]:
from matplotlib import pyplot as plt
from collections import defaultdict
In [73]:
weekRatings = defaultdict(list)
In [74]:
for d in datasetWithTimeValues:
    day = d['timeStruct'].tm_wday
    weekRatings[day].append(d['stars'])
In [75]:
weekAverages = {}
In [76]:
for d in weekRatings:
    weekAverages[d] = sum(weekRatings[d]) * 1.0 / len(weekRatings[d])
In [77]:
weekAverages
Out[77]:
{0: 3.7094594594594597,
 1: 3.715375187253166,
 2: 3.750551876379691,
 3: 3.763665361751486,
 4: 3.7551891653172382,
 5: 3.7231843981953134,
 6: 3.7072147651006713}
In [78]:
X = list(weekAverages.keys())
In [79]:
Y = [weekAverages[x] for x in X]
In [80]:
plt.plot(X, Y)
Out[80]:
[<matplotlib.lines.Line2D at 0x7fd3ad4f2828>]

It might be nicer to plot this as a bar plot:

In [81]:
plt.bar(X, Y)
Out[81]:
<Container object of 7 artists>

Let's zoom in to make the differences a bit more visible:

In [82]:
plt.ylim(3.6, 3.8)
plt.bar(X, Y)
Out[82]:
<Container object of 7 artists>

Next let's add some labels:

In [83]:
plt.ylim(3.6, 3.8)
plt.xlabel("Weekday")
plt.ylabel("Rating")
plt.title("Rating as a function of weekday")
plt.bar(X, Y)
Out[83]:
<Container object of 7 artists>

Finally let's rename the ticks to correspond to the days of the week

In [84]:
plt.ylim(3.6, 3.8)
plt.xlabel("Weekday")
plt.ylabel("Rating")
plt.xticks([0,1,2,3,4,5,6],['S', 'M', 'T', 'W', 'T', 'F', 'S'])
plt.title("Rating as a function of weekday")
plt.bar(X, Y)
Out[84]:
<Container object of 7 artists>