================================================== Q: I want to run your productGraph code on a different category than Baby clothes. I'm not quite sure what files I will need in the data folder. For example, if I wanted to run the code on Cell Phones and Accessories, would I need a different brands.txt.gz? Also, do I have have the get the Cell Phones review data into a format similar to review_Baby.votes.gz? I also have another question. Once I have saved a model use the saveModel command, how would I load a model in order to make predictions? A: It should basically be the same set of files -- reviews, metadata, and edge data. You can use a different brands.txt.gz if you have one, these just get added to the list of words that will be considered by the model. But most of the relevant brands should already be in that file. I'd also suggest checking out the code from our SIGIR 2015 paper. It's a considerably simpler model for the same task, and while our solution made use of visual features it could be used with any features that might be derived from text, metadata etc. It also avoids the (expensive) topic model fitting step in favor of a simpler Mahalanobis distance. As for saveModel, I mainly used that for evaluation, not to reload the model. To re-load the model, note that the code actually stores all of the weights as a contiguous vector (W in the class). So you'll want to write a simpler "saveModel" that just writes out that vector, so that all you'd have to do is load the vector from the file into a double*, and then "copy" it into W using the function "getG". Q: I'm still confused about the format of reviews_Baby.votes.gz. Is it generated from reviews_Baby.json.gz? A: Yep, I just wanted to generate a version that was faster to read (rather than using a json parser) and already had capitalization / punctuation removed. ================================================== Q: In product metadata there is a field called "related" which contains "also bought", "also viewed" etc. I just wanted to know what exactly is "also viewed", if I take it in literal sense then all bought products should be first viewed so "also bought" should be a subset of "also viewed" which is not the case, so does "also viewed" mean products which could have been potentially bought? or is it some thing else. A: In practice the graphs have limited out-degree, e.g. there are only 20 or so "also viewed" or "also bought" links. So while an "also bought" product should (presumably) have been "also viewed" at least once, it may still not be among the topmost "also viewed" products. E.g., consider somebody who wants to buy a camera. They'll click around on a bunch of cameras, such that the "also viewed" links are predominantly cameras. But they'll usually only buy one camera, so the "also bought" links won't tend to contain other cameras. ================================================== Q: I tried running your code but the readme file contains very less details to run it. Do you have any instructions to run your code other than readme file. I want to run it and see the output once. I tried but it is giving errors "Validation error didn't improve for 5 iterations. Stopping..." A: That's not an error, it just means that the code has finished Q: Oh ok. So now just I can use printAndSave.ccp to save the results and accuracy? A: The accuracy is already printed out when along with the above command. But yes, you can use printAndSave to print out the model and predictions etc. if you want them. ================================================== Q: I thoroughly enjoyed reading your paper entitled "Inferring Networks of Substitutable and Complimentary Products". It provides a unique perspective on the different categories of links between products, which can be harnessed to deliver useful information to consumers with differing objectives. In particular, I had a few questions regarding the presented concepts: 1. On page 3, it was stated that "The logistic vector \beta then determines which topic memberships should be similar (or dissimilar) in order for the products to be related". In what case would dissimilar topics imply related products(for substitutes)? 2. On page 5, it was described that "each topic is associated with a particular node in the category tree". How are the latent topics, which are integer valued, associated with these categories? 3. Amazon's "Users who viewed x also viewed y" is a mixture of both substitutes and compliments. Was manual selection performed to filter for substitutes only? A: 1) For substitutes this feature is probably not so important, but it can still account for asymmetries in the data, which exist even for substitutes. E.g. a popular, well-reviewed laptop might be a good substitute for an unpopular, poorly-reviewed one, but not vice versa. This will show up in our copurchasing graph in the sense that popular products have high in-degree in the "frequently browsed with" graph, but every product has constrained out-degree (since Amazon surfaces only the top-ranked recommendations). So ultimately this feature can account for qualitative aspects that explain why one product is more popular than another, even for substitutes. 2) It's still just a topic vector, like with regular LDA, except that it's sparse in the sense that theta_ik is constrained to be 0 for most topics k. The sparsity structure is determined by the category tree in the sense that theta_ik can be non-zero only if the topic k is associated with one of the nodes in the category of product i. Hopefully the figure in the paper explains this better, which is really just showing the sparsity structure. 3) No manual selection was performed, but note that we train our model to *differentiate* between substitutes and complements in the sense that substitutes should be predicted as non-complements and vice versa. As you say, "users who viewed x also viewed y" contains some complements, just as "users who bought x also bought y" contains some substitutes. But by differentiating between the two what we're really predicting is: substitutes for x: likely to be viewed with x but less likely to be bought with x, and complements for x = likely to be bought with x but less likely to be viewed with x Our intuition here is that this definition will account for "noise" in the two graphs, whose two main distinguishing features are that one is predominantly complements and one is predominantly substitutes. ================================================== Q: 1) Does one Amazon category node in "categoryTree" may have multiple micro categories? e.g. root has 10 micro categories, whereas "night out & cock tail dress" has only 1 micro category associated with it? 2) is "keyWords" section the probability of the words in the topic (aka. a micro category). e.g. Does 2.03744 in a micro-category topic vector mean 2.03% chance that the words appear in the micro category. 3) there are some negative values in keyWords. How should I interpret it? e.g. in \root {"warm_waterproof": -1.628679, "limited": 2.03744, "raining": 0.263257, "magnetic": -1.352259, "desirable": -0.633848, "yellow": 3.563716, "sleek": 3.200103, "four": 2.278092, "woods": -0.610003, ... } Can I interpret them as substitute (positive numbers) and complementary (negative numbers) 4) If we want to index the topic vectors and when new products come in, we hope to get which micro categories the new products belongs to. Since the index is basically TF-IDF, whereas topic vector is conditional probability. Which kind of index scheme would you recommend to use in order to get the related micro categories? A: 1) More common categories get more microcategories (up to a maximum of 10) 2) Not quite, you first need to pass them through a softmax function (which is why some values are negative) 3) They're just feature weights. These values then need to go through the feature functions as described in the paper 4) You have to compute the topic representation of the new product (i.e., document). This is the same as with LDA using the topic vectors above Q: For the softmax function, is it 1 / (1 + e^-x)? Then we should multiply each word's probability for a particular topic to get the probability for that product in that topic? If that's the case, would this penalize long documents with lots of words? A: Not quite; that won't add to 1. Softmax is just (1) exponentiate; (2) divide by the sum. It won't penalize long docs since topic proportions should be invariant to document length. Yes, p(topic|word) will be high for both boots and jeans. But you're considering p(topic|word) for every word in the doc, so this should be dominated by other boots/jeans specific words. ================================================== Q: I have a question regarding the meta-data. I downloaded the data of electronic product category and converted them into csv format. However, I did not find the time of when the sales rank was captured. As you mentioned the data is collected on an 18-year-period, I want to know if the order of sales rank is based on the daily, or weekly, or monthly rank throughout the 18 years. Because time series is very important in my research. Hope to hear from you soon. Thank you in advance. A: It's just whatever the salesrank was whenever the data was crawled (which all happened within around a 3-4 week period). I'd suggest just looking at the most recent review date from the review data, which will be about the same time that the salesrank was collected. ================================================== Q: For 'Cell Phones' dataset,I get 3,447,276 for number of ratings and 319,678 for number of items, which is different from the numbers in your paper: 5,929,668 and 223,680, respectively. Do you know why I am getting the wrong numbers? A: As I recall the number in the paper refers to the un-deduplicated dataset, rather than the deduplicated single-category file that you used. Here I wanted to include both "copies" of a product (even if they have the same reviews) in the event that they still have a different set of co-purchases/co-views. So you should use all edges from the metadata file if possible, but merge the product IDs in the event that they have the same reviews. *Or* just use the de-duplicated file as you are already doing. I think the above step will have negligible impact on the results. ================================================== Q: We tried to get the product graph by running the code on your website but we got the following error (make Baby.out): Validation error didn't improve for 5 iterations. Stopping... / Probability didn't improve for 2 iterations. Stoping... { "corpus": "data/reviews_Baby.votes.gz", "lambda": 0.500000, "nAdjectives": 13996, "nNouns": 67099, "nBrands": 6644, "nUsers": 97611, "nItems": 32845, "nRatings": 135639, "nEdges": {"also_viewed": 101709, "buy_after_viewing": 4, "also_bought": 145945, "bought_together": 6678}, "nWeights": 359921, "iterations":[...] } Then we found that the product graph for Baby clothing at: http://jmcauley.ucsd.edu/amazon/models_KDD/topK_Baby.json.gz. However, we are not sure if this is the product graph (Nodes being the ASINs and the edges of 4 types). Could you please clarify this? A: That output looks fine to me. Note that the dictionary being printed there is precisely the output of the algorithm (what does "iterations" say?). topK_Baby.json is a subset of popular products that we used to put together our interface (so yes, it is the "product graph"). The model files are also in there, which can be used to generate new graphs of recommendations. ================================================== Q: From the product graph, we realized that we are finding the substitute and complementary products as follows. Please correct us if we are wrong: 1. For an item A, an item B is in both "items_buy_after_viewing" and "items_bought_together"/"items_also_bought". A and B are complementary products. 2. For an item A, an item C is in both "items_buy_after_viewing" and "items_also_viewed". A and B are substitute products. For a product A, lets say product B is also viewed, bought after viewing and also bought. Each of these edges in the product graph have different weights. Is there a way to normalize these weights? We understand that it is a little tricky but can you suggest us with a means of normalizing these weights for substitute and complementary products? We observed that the scores/weights are negative. What does this mean? We might have missed these details in your paper. A: -- Item A and B are complements if they're in bought_together *or* also_bought (not "and") -- Item A and B are substitutable if they're in buy_after_viewing *or* also_viewed Not quite sure what you mean by normalizing weights -- the model is based on logistic regression, so the output (for each edge type) is really a probability that an edge exists (if it's a real value, maybe you need to pass it through a sigmoid first?) Q: May be normalizing is not an appropriate word, aggregating might be a better word to explain this situation. If an item is in both "bought_together" and "also_bought" sections of the product graph then can the two edges be aggregated to a single edge using an aggregation criteria such as sum (since the two items are complements)? Is there a better way to do this? We also observed that the scores for products in "items_buy_after_viewing" are consistently low compared to the other edge types. Is this because of a range that was fixed for "items_buy_after_viewing"? Or is it just based on the probability of an edge being present? A: Hmm, there's no "edge aggregation" going on. In the specific interface that I built it's really just surfacing "also_viewed" and "also_bought" links as substitutes and complements (respectively). I believe the "buy_after_viewing" scores are just lower because such links are rare in the corpus (at least for some categories), so the logistic regressor assigns them lower probabilities on average. This could be addressed by balancing the training set I suppose, but given that the interface just works by ranking scores their absolute values probably don't matter so much anyway. ================================================== Q: In your kdd15 paper you mentioned there was a product hierarchy from Amazon. The dataset you hosted only includes the list of categories each product belongs to (not necessarily paths or any hierarchical information). Do you have the explicit product category hierarchy that we can use? A: For all Clothing and Electronics subcategories, there's detailed (sub-) category information included on the dataset page I linked you. For movies, books, and music, we had to make use of category data from our old dataset (http://snap.stanford.edu/data/web-Amazon.html); for some reason Amazon didn't surface detailed category information for those categories at the time of our new crawl, but did for the old one. One day I hope to combine these old categories into the new dataset page, but haven't gotten around to it yet. ================================================== Q: We are not sure how are you calculating accuracy for the link prediction task though. Since you have a single model and accuracy (ACC = (TP+TN) / (TP + FP + FN + TN)) expresses how well the system is doing at predicting both classes, you are probably doing something like this: ACC_subs = TP / (TP + FP) which is precision or positive predictive value and ACC_comp = TN / (FN + TN) which is the negative predictive value. Is this correct? Or are you calculating it in a different way? A: The numbers in the paper really are just the fraction of correct predictions for each edge type ((true positives + true negatives) / #examples). The two numbers reported are just for the two different edge types, each of which is a binary classification task. The funny/annoying this is that an edge of type "complement" is a negative example for an edge of type "substitute" (and vice versa), so in addition to the randomly sampled non-edges it's hard to build a balanced dataset (hence why random is not 50/50).