Recommender Systems Datasets

Julian McAuley, UCSD

Description

This page contains a collection of recommender systems datasets that have been used for research in my lab. Datasets contain the following features:

Please cite the appropriate reference if you use any of the datasets below.

Datasets are in (loose) json format unless specified otherwise, meaning they can be treated as python dictionary objects. A simple script to read json-formatted data is as follows:

def parse(path): g = gzip.open(path, 'r') for l in g: yield eval(l)

Directory by Dataset

Amazon product reviews and metadata

Amazon question/answer data

Google Local business reviews and metadata

Steam video game reviews and bundles

Goodreads book reviews

ModCloth clothing fit feedback

RentTheRunway clothing fit feedback

Tradesy bartering data

RateBeer bartering data

Gameswap bartering data

Behance community art reviews and image features

Librarything reviews and social data

Epinions reviews and social data

Dance Dance Revolution step charts

NES song data

BeerAdvocate multi-aspect beer reviews

RateBeer multi-aspect beer reviews

Facebook social circles data

Twitter social circles data

Google+ social circles data

Reddit submission popularity and metadata

Directory by Metadata Type

The datasets below can be roughly organized in terms of the types of metadata they contain:

Review text: see Amazon, BeerAdvocate, RateBeer, Google Local

Image data: Amazon, Behance

Item-to-item relationships: Amazon

Q/A data: Amazon Q/A

Geographical data: Google Local

Bundle data: Steam

Peer-to-peer trades: Tradesy, RateBeer, Gameswap

Social connections: Librarything, Epinions

Fit feedback: Modcloth, Renttherunway

Multple aspects: BeerAdvocate, RateBeer



Amazon Product Reviews

Description

This is a large crawl of product reviews from Amazon. This dataset contains 82.83 million unique reviews, from around 20 million users.

Basic statistics

Ratings: 82.83 million
Users: 20.98 million
Items: 9.35 million
Timespan: May 1996 - July 2014

Metadata

Example

{ "reviewerID": "A2SUAM1J3GNN3B", "asin": "0000013714", "reviewerName": "J. McDonald", "helpful": [2, 3], "reviewText": "I bought this for my husband who plays the piano. He is having a wonderful time playing these old hymns. The music is at times hard to read because we think the book was published for singing from more than playing from. Great purchase though!", "overall": 5.0, "summary": "Heavenly Highway Hymns", "unixReviewTime": 1252800000, "reviewTime": "09 13, 2009" }

Download link

See the Amazon Dataset Page for download information.

Citation

Please cite the following if you use the data:

Ups and downs: Modeling the visual evolution of fashion trends with one-class collaborative filtering
R. He, J. McAuley
WWW, 2016
pdf

Image-based recommendations on styles and substitutes
J. McAuley, C. Targett, J. Shi, A. van den Hengel
SIGIR, 2015
pdf



Amazon Question and Answer Data

Description

These datasets contain questions and answers about products from the Amazon dataset above.

Basic statistics

Questions: 1.48 million
Answers: 4,019,744
Labeled yes/no questions: 309,419
Number of unique products with questions: 191,185

Metadata

Example

{ "asin": "B000050B6Z", "questionType": "yes/no", "answerType": "Y", "answerTime": "Aug 8, 2014", "unixTime": 1407481200, "question": "Can you use this unit with GEL shaving cans?", "answer": "Yes. If the can fits in the machine it will despense hot gel lather. I've been using my machine for both , gel and traditional lather for over 10 years." }

Download link

See the Amazon Q/A Page for download information.

Citation

Please cite the following if you use the data:

Modeling ambiguity, subjectivity, and diverging viewpoints in opinion question answering systems
Mengting Wan, Julian McAuley
International Conference on Data Mining (ICDM), 2016
pdf

Addressing complex and subjective product-related queries with customer reviews
Julian McAuley, Alex Yang
World Wide Web (WWW), 2016
pdf



Google Local Reviews

Description

These datasets contain reviews about businesses from Google Local (Google Maps). Data includes geographic information for each business as well as reviews.

Basic statistics

Reviews: 11,453,845
Users: 4,567,431
Businesses: 3,116,785

Metadata

Example (review)

{ 'rating': 3.0, 'reviewerName': u'an lam', 'reviewText': u'Ch\u1ea5t l\u01b0\u1ee3ng t\u1ea1m \u1ed5n', 'categories': [u'Gi\u1ea3i Tr\xed - Caf\xe9'], 'gPlusPlaceId': u'108103314380004200232', 'unixReviewTime': 1372686659, 'reviewTime': u'Jul 1, 2013', 'gPlusUserId': u'100000010817154263736' }

Example (business)

{ 'name': u'Diamond Valley Lake Marina', 'price': None, 'address': [u'2615 Angler Ave', u'Hemet, CA 92545'], 'hours': [[u'Monday', [[u'6:30 am--4:15 pm']]], [u'Tuesday', [[u'6:30 am--4:15 pm']]], [u'Wednesday', [[u'6:30 am--4:15 pm']], 1], [u'Thursday', [[u'6:30 am--4:15 pm']]], [u'Friday', [[u'6:30 am--4:15 pm']]], [u'Saturday', [[u'6:30 am--4:15 pm']]], [u'Sunday', [[u'6:30 am--4:15 pm']]]], 'phone': u'(951) 926-7201', 'closed': False, 'gPlusPlaceId': '104699454385822125632', 'gps': [33.703804, -117.003209] }

Download links

Places Data (276mb)

User Data (178mb)

Review Data (1.4gb)

Citation

Please cite the following if you use the data:

Translation-based factorization machines for sequential recommendation
Rajiv Pasricha, Julian McAuley
RecSys, 2018
pdf

Translation-based recommendation
Ruining He, Wang-Cheng Kang, Julian McAuley
RecSys, 2017
pdf



Steam Video Game and Bundle Data

Description

These datasets contain reviews from the Steam video game platform, and information about which games were bundled together.

Basic statistics

Reviews: 59,305
Purchases: 5,153,209
Users: 88,310
Items: 10,978
Bundles: 615

Metadata

Example (bundle)

{ 'bundle_final_price': '$29.66', 'bundle_url': 'http://store.steampowered.com/bundle/1482/?utm_source=SteamDB...', 'bundle_price': '$32.96', 'bundle_name': 'Two Tribes Complete Pack!', 'bundle_id': '1482', 'items': [{'genre': 'Casual, Indie', 'item_id': '38700', 'discounted_price': '$4.99', 'item_url': 'http://store.steampowered.com/app/38700', 'item_name': 'Toki Tori'}, {'genre': 'Adventure, Casual, Indie', 'item_id': '201420', 'discounted_price': '$14.99', 'item_url': 'http://store.steampowered.com/app/201420', 'item_name': 'Toki Tori 2+'}, {'genre': 'Strategy, Indie, Casual', 'item_id': '38720', 'discounted_price': '$4.99', 'item_url': 'http://store.steampowered.com/app/38720', 'item_name': 'RUSH'}, {'genre': 'Action, Indie', 'item_id': '38740', 'discounted_price': '$7.99', 'item_url': 'http://store.steampowered.com/app/38740', 'item_name': 'EDGE'}], 'bundle_discount': '10%' }

Download links

Review Data (6.7mb)

User and Item Data (71mb)

Review Data (92kb)

Citation

Please cite the following if you use the data:

Item recommendation on monotonic behavior chains
Mengting Wan, Julian McAuley
RecSys, 2018
pdf

Generating and personalizing bundle recommendations on Steam
Apurva Pathak, Kshitiz Gupta, Julian McAuley
SIGIR, 2017
pdf



Goodreads Book Reviews

These datasets contain reviews from the Goodreads book review website, and a variety of attributes describing the items. Critically, these datasets have multiple levels of user interaction, raging from adding to a "shelf", rating, and reading.

Basic statistics

Items: 1,561,465
Users: 808,749
Interactions: 225,394,930

Metadata

Example (interaction data)

{ "user_id": "8842281e1d1347389f2ab93d60773d4d", "book_id": "130580", "review_id": "330f9c153c8d3347eb914c06b89c94da", "isRead": true, "rating": 4, "date_added": "Mon Aug 01 13:41:57 -0700 2011", "date_updated": "Mon Aug 01 13:42:41 -0700 2011", "read_at": "Fri Jan 01 00:00:00 -0800 1988", "started_at": "" }

Download links

See our project page for download links.

Citation

Please cite the following if you use the data:

Item recommendation on monotonic behavior chains
Mengting Wan, Julian McAuley
RecSys, 2018
pdf



Clothing Fit Data

Description

These datasets contain measurements of clothing fit from ModCloth and RentTheRunway.

Basic statistics

Modcloth Renttherunway
Number of users: 47,958 105,508
Number of items: 1,378 5,850
Number of transactions: 82,790 192,544

Metadata

Example (RentTheRunway)

{ "fit": "fit", "user_id": "420272", "bust size": "34d", "item_id": "2260466", "weight": "137lbs", "rating": "10", "rented for": "vacation", "review_text": "An adorable romper! Belt and zipper were a little hard to navigate in a full day of wear/bathroom use, but that's to be expected. Wish it had pockets, but other than that-- absolutely perfect! I got a million compliments.", "body type": "hourglass", "review_summary": "So many compliments!", "category": "romper", "height": "5' 8\"", "size": 14, "age": "28", "review_date": "April 20, 2016" }

Download links

Modcloth (8.5mb)

Renttherunway (31mb)

Citation

Please cite the following if you use the data:

Decomposing fit semantics for product size recommendation in metric spaces
Rishabh Misra, Mengting Wan, Julian McAuley
RecSys, 2018
pdf



Product Exchange/Bartering Data

Description

These datasets contain peer-to-peer trades from various recommendation platforms.

Basic statistics

Tradesy Ratebeer Gameswap
Number of users: 128,152 2,215 9,888
Number of transactions: 68,543 125,665 3,470

Metadata

Example (tradesy)

{ 'lists': { 'bought': ['466', '459', '457', '449'], 'selling': [], 'want': [], 'sold': ['104', '103', '102'] }, 'uid': '2' }

Download links

Tradesy (3.8mb)

See the project page for ratebeer, gameswap (and other) datasets

Citation

Please cite the following if you use the data:

Bartering books to beers: A recommender system for exchange platforms
Jérémie Rappaz, Maria-Luiza Vladarean, Julian McAuley, Michele Catasta
WSDM, 2017
pdf

VBPR: Visual bayesian personalized ranking from implicit feedback
Ruining He, Julian McAuley
AAAI, 2016
pdf



Behance Community Art Data

Description

Likes and image data from the community art website Behance. This is a small, anonymized, version of a larger proprietary dataset.

Basic statistics

Users: 63,497
Items: 178,788
Appreciates ("likes"): 1,000,000

Metadata

Example ("appreciate" data)

Each entry is a user, item, timestamp triple:

276633 01588231 1307583271 1238354 01529213 1307583273 165550 00485000 1307583337 2173258 00776972 1307583340 165550 00158226 1307583406 1238354 01540285 1307583495 2459267 01578261 1307583509 165550 00264669 1307583518 165550 00171501 1307583536

Code to read image features

import struct def readImageFeatures(path): f = open(path, 'rb') while True: itemId = f.read(8) if itemId == '': break feature = struct.unpack('f'*4096, f.read(4*4096)) yield itemId, feature

Download links

See our Google Drive folder containing all Behance files. The folder also contains additional documentation.

Citation

Please cite the following if you use the data:

Vista: A visually, socially, and temporally-aware model for artistic recommendation
Ruining He, Chen Fang, Zhaowen Wang, Julian McAuley
RecSys, 2016
pdf



Social Recommendation Data

Description

These datasets include ratings as well as social (or trust) relationships between users. Data are from LibraryThing (a book review website) and epinions (general consumer reviews).

Basic statistics

Librarything Epinions
Number of users: 73,882 41,554
Number of items: 337,561 112,991
Number of ratings/feedback: 979,053 181,394
Number of social relations: 120,536 181,304

Metadata

Example (LibraryThing reviews)

{ 'work': '3067', 'flags': [], 'unixtime': 1160265600, 'stars': 4.5, 'nhelpful': 0, 'time': 'Oct 8, 2006', 'comment': 'great storytelling in this novel about a couple crossed by a time travelling disorder ', 'user': 'justine' }

Example (LibraryThing social network)

Rodo anehan Rodo sevilemar Rodo dingsi Rodo slash RelaxedReader AnnRig RelaxedReader bookbroke RelaxedReader Bumpersmom RelaxedReader DivaColumbus RelaxedReader AnnRig RelaxedReader bookbroke RelaxedReader BookWorm2729 RelaxedReader Bumpersmom

Download links

LibraryThing (594mb)

epinions (66mb)

Citation

Please cite the following if you use the data:

SPMC: Socially-aware personalized Markov chains for sparse sequential recommendation
Chenwei Cai, Ruining He, Julian McAuley
IJCAI, 2017
pdf

Improving latent factor models via personalized feature projection for one-class recommendation
Tong Zhao, Julian McAuley, Irwin King
Conference on Information and Knowledge Management (CIKM), 2015
pdf



Older and Non-Recommender-Systems Datasets

Description

Below are older datasets, as well as datasets collected by my lab that are not related to recommender systems specifically. Formats of these datasets vary, so their respective project pages should be consulted for further details.



Video Game Data

Description

Step charts from the video game Dance Dance Revolution, and audio files from the NES platform.

Basic statistics

Num songs (DDR): 223 (7 hours)
Num charts (DDR): 1,102
Num games (NES): 397
Num songs (NES): 5,278 (46 hours)
Num notes (NES): 2,325,636

Download links

See the project pages for Dance Dance Convolution and NES MDB for further details and links to the data



Multi-aspect Reviews

Description

These datasets include reviews with multiple rated dimensions. The most comprehensive of these are beer review datasets from Ratebeer and Beeradvocate, which include sensory aspects such as taste, look, feel, and smell.

Basic statistics

Ratebeer BeerAdvocate
Number of users: 40,213 33,387
Number of items: 110,419 66,051
Number of ratings/reviews: 110,419 1,586,259
Timespan: Apr 2000 - Nov 2011 Jan 1998 - Nov 2011

Metadata

Example (ratebeer)

beer/name: John Harvards Simcoe IPA beer/beerId: 63836 beer/brewerId: 8481 beer/ABV: 5.4 beer/style: India Pale Ale (IPA) review/appearance: 4/5 review/aroma: 6/10 review/palate: 3/5 review/taste: 6/10 review/overall: 13/20 review/time: 1157587200 review/profileName: hopdog review/text: On tap at the Springfield, PA location. Poured a deep and cloudy orange (almost a copper) color with a small sized off white head. Aromas or oranges and all around citric. Tastes of oranges, light caramel and a very light grapefruit finish. I too would not believe the 80+ IBUs - I found this one to have a very light bitterness with a medium sweetness to it. Light lacing left on the glass.

Download links

See SNAP beeradvocate and ratebeer dataset pages

Citation

Please cite the following if you use the data:

Learning attitudes and attributes from multi-aspect reviews
Julian McAuley, Jure Leskovec, Dan Jurafsky
International Conference on Data Mining (ICDM), 2012
pdf

From amateurs to connoisseurs: modeling the evolution of user expertise through online reviews
Julian McAuley, Jure Leskovec
WWW, 2013
pdf



Social Circles

Description

These datasets contain social connections and "circles" from Facebook, Twitter, and Google Plus.

Basic statistics

Facebook Twitter Google Plus
Number of networks: 10 133 1,000
Number of nodes: 4,039 106,674 192,075
Number of circles: 193 479 5,541

Metadata

Example (Kaggle egonet data)

UserId: Friends 1: 4 6 12 2 208 2: 5 3 17 90 7

Download links

See SNAP facebook, twitter, and Google Plus data, as well as the Kaggle competition based on the same data.

Citation

Please cite the following if you use the data:

Learning to Discover Social Circles in Ego Networks
Julian McAuley, Jure Leskovec
Neural Information Processing Systems (NIPS), 2012
pdf



Reddit Submissions

Description

Submissions of reddit posts (and in particular resubmissions of the same content) along with metadata.

Basic statistics

Num of submissions (images): 132,308
Num of unique images: 16,736
Timespan July 2008 - January 2013

Metadata

Example

#image_id,unixtime,rawtime,title,total_votes,reddit_id,... number_of_downvotes,localtime,score,number_of_comments,username 1005,1335861624,2012-05-01T15:40:24.968266-07:00,I immediately regret this decision,27,t296r,20,pics,7,1335886824,13,0,ninjaroflmaster 1005,1336470481,2012-05-08T16:48:01.418140-07:00,"Pushing your friend into the water,Level: 99",18,tds4i,16,funny,2,1336495681,14,0,hme4 1005,1339566752,2012-06-13T12:52:32.371941-07:00,I told him. He Didn't Listen,6,v0cma,4,funny,2,1339591952,2,0,HeyPatWhatsUp 1005,1342200476,2012-07-14T00:27:56.857805-07:00,Don't end up as this guy.,16,wjivx,7,funny,9,1342225676,-2,2,catalyst24

Download links

resubmissions data (7.3mb)

raw html of resubmissions (1.8gb)

See also the SNAP project page.

Citation

Please cite the following if you use the data:

Understanding the interplay between titles, content, and communities in social media
Himabindu Lakkaraju, Julian McAuley, Jure Leskovec
ICWSM, 2013
pdf



Questions and comments to Julian McAuley