Amazon Review Data (2018)
Jianmo Ni, UCSD
Please see the 2023 version of this dataset
This (older) version is mainly here for the sake of reproducing past results
Description
This Dataset is an updated version of the Amazon review dataset released in 2014. As in the previous version, this dataset includes reviews (ratings, text, helpfulness votes), product metadata (descriptions, category information, price, brand, and image features), and links (also viewed/also bought graphs). In addition, this version provides the following features:
- More reviews:
- The total number of reviews is 233.1 million (142.8 million in 2014).
- Newer reviews:
- Current data includes reviews in the range May 1996 - Oct 2018.
- Metadata:
- We have added transaction metadata for each review shown on the review page. Such information includes:
- Product information, e.g. color (white or black), size (large or small), package type (hardcover or electronics), etc.
- Product images that are taken after the user received the product.
- Added more detailed metadata of the product landing page. Such detailed information includes:
- Bullet-point descriptions under product title.
- Technical details table (attribute-value pairs).
- Similar products table.
- We have added transaction metadata for each review shown on the review page. Such information includes:
- More categories:
- Includes 5 new product categories.
You can also download the review data from our previous datasets.
Citation
Please cite the following paper if you use the data in any way:
Justifying recommendations using distantly-labeled reviews and fined-grained aspects
Jianmo Ni, Jiacheng Li, Julian McAuley
Empirical Methods in Natural Language Processing (EMNLP), 2019
pdf
News
05/2021 We updated high resolution image urls to the metadata!
08/2020 We have updated the metadata and now it includes much less HTML/CSS code. Feel free to download the updated data!
Note
We provide a colab notebook that helps you parse and clean the data. For example:
- Load the metadata (e.g. as JSON or DataFrame)
- Check if title has HTML contents and filter them
We provide a colab notebook that helps you find target products and obtain their reviews!
We appreciate any help or feedback to improve the quality of our dataset! Feel free to reach us at jin018@ucsd.edu if you meet any following questions:
- Unparsed HTML contents
- Duplicate items which have same reviews
Directory
Files
Complete review data
Please only download these (large!) files if you really need them. We recommend using the smaller datasets (i.e. k-core and CSV files) as shown in the next section.
raw review data (34gb) - all 233.1 million reviews
ratings only (6.7gb) - same as above, in csv form without reviews or metadata
5-core (14.3gb) - subset of the data in which all users and items have at least 5 reviews (75.26 million reviews)
meta data (12gb) - meta data for all products
- We also provide a colab notebook that helps you parse and clean the data.
Per-category data - the review and product metadata for each category:
Amazon Fashion | reviews (883,636 reviews) | metadata (186,637 products) |
All Beauty | reviews (371,345 reviews) | metadata (32,992 products) |
Appliances | reviews (602,777 reviews) | metadata (30,459 products) |
Arts Crafts and Sewing | reviews (2,875,917 reviews) | metadata (303,426 products) |
Automotive | reviews (7,990,166 reviews) | metadata (932,019 products) |
Books | reviews (51,311,621 reviews) | metadata (2,935,525 products) |
CDs and Vinyl | reviews (4,543,369 reviews) | metadata (544,442 products) |
Cell Phones and Accessories | reviews (10,063,255 reviews) | metadata (590,269 products) |
Clothing Shoes and Jewelry | reviews (32,292,099 reviews) | metadata (2,685,059 products) |
Digital Music | reviews (1,584,082 reviews) | metadata (465,392 products) |
Electronics | reviews (20,994,353 reviews) | metadata (786,868 products) |
Gift Cards | reviews (147,194 reviews) | metadata (1,548 products) |
Grocery and Gourmet Food | reviews (5,074,160 reviews) | metadata (287,209 products) |
Home and Kitchen | reviews (21,928,568 reviews) | metadata (1,301,225 products) |
Industrial and Scientific | reviews (1,758,333 reviews) | metadata (167,524 products) |
Kindle Store | reviews (5,722,988 reviews) | metadata (493,859 products) |
Luxury Beauty | reviews (574,628 reviews) | metadata (12,308 products) |
Magazine Subscriptions | reviews (89,689 reviews) | metadata (3,493 products) |
Movies and TV | reviews (8,765,568 reviews) | metadata (203,970 products) |
Musical Instruments | reviews (1,512,530 reviews) | metadata (120,400 products) |
Office Products | reviews (5,581,313 reviews) | metadata (315,644 products) |
Patio Lawn and Garden | reviews (5,236,058 reviews) | metadata (279,697 products) |
Pet Supplies | reviews (6,542,483 reviews) | metadata (206,141 products) |
Prime Pantry | reviews (471,614 reviews) | metadata (10,815 products) |
Software | reviews (459,436 reviews) | metadata (26,815 products) |
Sports and Outdoors | reviews (12,980,837 reviews) | metadata (962,876 products) |
Tools and Home Improvement | reviews (9,015,203 reviews) | metadata (571,982 products) |
Toys and Games | reviews (8,201,231 reviews) | metadata (634,414 products) |
Video Games | reviews (2,565,349 reviews) | metadata (84,893 products) |
"Small" subsets for experimentation
If you're using this data for a class project (or similar) please consider using one of these smaller datasets below before requesting the larger files.
K-cores (i.e., dense subsets): These data have been reduced to extract the k-core, such that each of the remaining users and items have k reviews each.
Ratings only: These datasets include no metadata or reviews, but only (item,user,rating,timestamp) tuples. Thus they are suitable for use with mymedialite (or similar) packages.
You can directly download the following smaller per-category datasets.
Amazon Fashion | 5-core (3,176 reviews) | ratings only (883,636 ratings) |
All Beauty | 5-core (5,269 reviews) | ratings only (371,345 ratings) |
Appliances | 5-core (2,277 reviews) | ratings only (602,777 ratings) |
Arts, Crafts and Sewing | 5-core (494,485 reviews) | ratings only (2,875,917 ratings) |
Automotive | 5-core (1,711,519 reviews) | ratings only (7,990,166 ratings) |
Books | 5-core (27,164,983 reviews) | ratings only (51,311,621 ratings) |
CDs and Vinyl | 5-core (1,443,755 reviews) | ratings only (4,543,369 ratings) |
Cell Phones and Accessories | 5-core (1,128,437 reviews) | ratings only (10,063,255 ratings) |
Clothing, Shoes and Jewelry | 5-core (11,285,464 reviews) | ratings only (32,292,099 ratings) |
Digital Music | 5-core (169,781 reviews) | ratings only (1,584,082 ratings) |
Electronics | 5-core (6,739,590 reviews) | ratings only (20,994,353 ratings) |
Gift Cards | 5-core (2,972 reviews) | ratings only (147,194 ratings) |
Grocery and Gourmet Food | 5-core (1,143,860 reviews) | ratings only (5,074,160 ratings) |
Home and Kitchen | 5-core (6,898,955 reviews) | ratings only (21,928,568 ratings) |
Industrial and Scientific | 5-core (77,071 reviews) | ratings only (1,758,333 ratings) |
Kindle Store | 5-core (2,222,983 reviews) | ratings only (5,722,988 ratings) |
Luxury Beauty | 5-core (34,278 reviews) | ratings only (574,628 ratings) |
Magazine Subscriptions | 5-core (2,375 reviews) | ratings only (89,689 ratings) |
Movies and TV | 5-core (3,410,019 reviews) | ratings only (8,765,568 ratings) |
Musical Instruments | 5-core (231,392 reviews) | ratings only (1,512,530 ratings) |
Office Products | 5-core (800,357 reviews) | ratings only (5,581,313 ratings) |
Patio, Lawn and Garden | 5-core (798,415 reviews) | ratings only (5,236,058 ratings) |
Pet Supplies | 5-core (2,098,325 reviews) | ratings only (6,542,483 ratings) |
Prime Pantry | 5-core (137,788 reviews) | ratings only (471,614 ratings) |
Software | 5-core (12,805 reviews) | ratings only (459,436 ratings) |
Sports and Outdoors | 5-core (2,839,940 reviews) | ratings only (12,980,837 ratings) |
Tools and Home Improvement | 5-core (2,070,831 reviews) | ratings only (9,015,203 ratings) |
Toys and Games | 5-core (1,828,971 reviews) | ratings only (8,201,231 ratings) |
Video Games | 5-core (497,577 reviews) | ratings only (2,565,349 ratings) |
Data format
Format is one-review-per-line in json. See examples below for further help reading the data.
Sample review:
where
- reviewerID - ID of the reviewer, e.g. A2SUAM1J3GNN3B
- asin - ID of the product, e.g. 0000013714
- reviewerName - name of the reviewer
- vote - helpful votes of the review
- style - a disctionary of the product metadata, e.g., "Format" is "Hardcover"
- reviewText - text of the review
- overall - rating of the product
- summary - summary of the review
- unixReviewTime - time of the review (unix time)
- reviewTime - time of the review (raw)
- image - images that users post after they have received the product
Metadata
Metadata includes descriptions, price, sales-rank, brand info, and co-purchasing links:
metadata (24gb) - metadata for 15.5 million products
Sample metadata:
where
- asin - ID of the product, e.g. 0000031852
- title - name of the product
- feature - bullet-point format features of the product
- description - description of the product
- price - price in US dollars (at time of crawl)
- imageURL - url of the product image
- imageURL - url of the high resolution product image
- related - related products (also bought, also viewed, bought together, buy after viewing)
- salesRank - sales rank information
- brand - brand name
- categories - list of categories the product belongs to
- tech1 - the first technical detail table of the product
- tech2 - the second technical detail table of the product
- similar - similar product table
Code
Reading the data
Data can be treated as python dictionary objects. A simple script to read any of the above the data is as follows:
Pandas data frame
This code reads the data into a pandas data frame:
Example: compute average rating
Example: latent-factor model in mymedialite
Predicts ratings from a rating-only CSV file