#################################################################################################### KDD (accept) #################################################################################################### Dear Julian McAuley, After a review of the procedure followed in reaching the decision for this paper we have overturned the original decision and are delighted to accept it. We misinterpreted a comment from a reviewer in the discussion. When reassessed it is clear that this paper does meet the bar for inclusion in the conference. Please accept our apologies for the procedural error and our congratulations on having your paper accepted. #################################################################################################### KDD (reject) #################################################################################################### Masked Reviewer ID: Assigned_Reviewer_1 Review: Question *** How would you rate the novelty of the problem solved in this paper? A minor variation of some well studied problems *** How would you rate the technical ideas and development in this paper? Substantial improvement over state-of-the-art methods *** How would you rate the empirical study conducted in this paper? Acceptable, but there is room for improvement *** Repeatability: are the data sets used publicly available (and thus the experiments may be repeated by a third party)? Partially (e.g., some of the used datasets are proprietary) *** How would you rate the quality of presentation? A very well-written paper, a pleasure to read. No obvious flaws. *** What is your overall recommendation? Weak accept. I vote for acceptance, although would not be upset if it were rejected. *** List up to 3 particular strengths of the paper. If none, just say "none". - Interesting problem to study. Strong motivation. - Solid model to tackle the problem *** List up to 3 particular weaknesses of this paper. If none, just say "none". - Evaluation section didn't the effectiveness of the model in predicting asymmetric relationship - Didn't show benefit of using content-based model to infer the relationship between products, compared to behavior-based models - Section 3.5 (cold start problem) could be expanded with details *** Detailed comments for the authors; justification of your overall rating. The paper studied the problem of identifying different relationships between products, including substitutions and complements. To tackle the problem, the authors proposed to leverage topic model and the products's review information. Experiments showed the effectiveness of their model. The paper was well written in general. I like that the authors considered the asymmetric relationship between products and integrated it in the model. My major concerns included the following: - I didn't see clear demonstration of the asymmetric relationship in the evaluation part. It was not clear whether the ground truth contained those information or not. The authors could probably test their model in a recommendation setting and evaluated whether the recommendations were better. - The authors used product's contents from its reviews in the model while the ground truth (user viewed x also viewed y, etc.) used the behavior information. It would be better to show the benefit of using the content-based method, compared to the behavior-based method, such as in cold-start products with no behaviors. - The authors talked about another type of cold-start problem for products with no review information. I'd like to see more details in that section. *** List of typos, grammatical errors and/or concrete suggestions to improve presentation no Masked Reviewer ID: Assigned_Reviewer_2 Review: Question *** How would you rate the novelty of the problem solved in this paper? A fundamentally novel problem *** How would you rate the technical ideas and development in this paper? Substantial improvement over state-of-the-art methods *** How would you rate the empirical study conducted in this paper? Acceptable, but there is room for improvement *** Repeatability: are the data sets used publicly available (and thus the experiments may be repeated by a third party)? Partially (e.g., some of the used datasets are proprietary) *** How would you rate the quality of presentation? Well-written but has a significant number of typos and/or grammatical errors. Would need significant editing before publication. *** What is your overall recommendation? Accept. I vote and argue for acceptance, clearly belongs in the conference. *** List up to 3 particular strengths of the paper. If none, just say "none". - The investigation of the substitutable and complementary relationship buried in the product set using their review texts contains some new elements. - Extensive experiments have been conducted on a large data set, and promising experimental results were obtained. *** List up to 3 particular weaknesses of this paper. If none, just say "none". - It is not clear about how to achieve sparse representation in Sec.2.2.3. Detailed comments for the authors; justification of your overall rating. This paper proposes a supervised graphical topic model to automatically identify substitutes and complements for a particular product, using text from online reviews. It is important to investigate such task for a useful recommender system. They evaluate their approach on a large Amazon dataset, and the experimental results demonstrate that their approach can identify the substitutes and complements more accurately compared with the baseline approaches. The task of investigating the substitutable and complementary relationship buried in the products set using their review texts contains some new elements. The strategy of joint training of the topic and the logistic regression models is new. It can facilitate the advantage that the trained topic should be discovered with high likelihood for the substitutes and complements known labels. The proposed model is scalable for large data sets based on the sparse topic representation for each product. Extensive experiments have been conducted on a large Amazon data set with several different domains. The experimental results on linking prediction and ranking show the superiority of their approach compared with the baseline approaches. One issue is that it would be better to provide the technical details for achieving sparse topic representation in Sec.2.2.3. It seems that the hierarchical LDA could be considered to make use of the topical hierarchy. *** List of typos, grammatical errors and/or concrete suggestions to improve presentation Providing the running time for each experiment should also be considered. Also, it would provide more insights if some sample review text segments can be shown to illustrate two substitutable or complementary products. Masked Reviewer ID: Assigned_Reviewer_3 Review: Question *** How would you rate the novelty of the problem solved in this paper? A minor variation of some well studied problems *** How would you rate the technical ideas and development in this paper? The technical development has some flaws *** How would you rate the empirical study conducted in this paper? Not thorough, or even faulty *** Repeatability: are the data sets used publicly available (and thus the experiments may be repeated by a third party)? Partially (e.g., some of the used datasets are proprietary) *** How would you rate the quality of presentation? A very well-written paper, a pleasure to read. No obvious flaws. *** What is your overall recommendation? Reject. Clearly below the standards for the conference. *** List up to 3 particular strengths of the paper. If none, just say "none". S1. The problem addressed is important and interesting. S2. The techniques developed seem reasonable and sophisticated. S3. The paper is well organized and well written. Experiments shows at least an order of magnitude improvement over the baseline of "collaborative filtering"-- which is a rather amazing scale of improvement. List up to 3 particular weaknesses of this paper. If none, just say "none". W1. The problem has not been well motivated and justified. W2. The problem is not well formulated, for its correctness and performance requirements. W3. The proposed techniques are not novel and not principled. W4. The experiments are rather weak and flawed for validating the claims. Detailed comments for the authors; justification of your overall rating. This paper studies the problem of constructing product relationship graph automatically without relying on browsing and co-purchasing data, but instead from review text and ratings, for multiple relationships (substitute, complement). It solves the problem as supervised learning problem: First, it learns topical features by topic modeling, to learn a set of topics that may characterize each product. Second, it learns a model for link prediction, using the learned topic features and other manifest features such as rating, brand, price. Both learning objectives are combined in an objective function to learn the model parameters by optimizing over a graph given from browsing and co-purchasing data. The problem addressed is important and interesting. The techniques developed seem reasonable and sophisticated. The paper is well organized and well written. Experiments shows at least an order of magnitude improvement over the baseline of "collaborative filtering"-- which is a rather amazing scale of improvement. However, at a deeper look, the paper seems to address the problem quite superficially, and has several fundamental flaws. W1. The problem has not been well motivated and justified. The main motivation for using reviews and product specification is that "co-counting" from browsing and purchasing log 1) can be noisy for infrequently purchased products and 2) cannot explain recommendation it provides. These arguments are not convincingly developed. First, from the nature of signals used, I am not convinced that the paper's approach can be more effective for establishing product relationship (and the experiments failed entirely to validate this claim). Regardless of the actual techniques, from the nature of signals used-- user browsing/purchasing behavior is essentially "crowd-sourcing", by exploiting implicit human interpretation and judgement of products. Such behavioral data is also much more abundant and dynamic than reviews/ratings-- how many browsing would become a purchase, and how many purchases would render a review? Product relationships can be quite subtle and dynamic. E.g., a product becomes competitive because of its being on sale; a product becomes obsolete because an upgraded model just comes up. While user browsing will collectively update, reviews/ratings are mostly static and not updated (e.g., an outdated once-popular product still has the same reviews). Second, for the issues raised by the paper, noises and explanation, it is not clear how these issues are inherent/severe that needs to give up user behavior data all together. For #1, what exactly are the noises? For #2, why would the co-counting approach be fundamentally unable to provide "explanation"? These arguments are not established analytically or even empirically. Third, while the authors claim the "co-counting" approach as inappropriate, in the end, it turns out that the proposed approach relies on that problematic approach to establish training data. This reliance does not seem logical. W2. The problem is not well formulated, for its correctness and performance requirements. While the paper advocates building a product network, what would be considered a good product graph? What are the requirements of a product graph? Given that product market is fast changing, how should a product graph change accordingly? How about maintenance of a graph? What metrics should we use to evaluate various quality aspects of a graph? None of these are clear. W3. The proposed techniques are not novel and not principled. While the solutions of a supervised learning approach looks intuitive, as it builds upon various existing work, it is not clear what new technical challenges it is addressing, and what technical novelty and merits it contributes. One the one hand, the paper does not identify the particular challenges for realizing its objectives, and it is thus unclear what unique technical barrier it is tackling. On the other hand, the solutions proposed mostly built upon various pieces of existing work with the core joint training/regression model from earlier work [24], and thus the technical novelty is quite limited. Beyond assembling existing techniques, the paper does not clearly reasoned why the proposed overall framework-- and thus the solutions feel ad-hoc and do not covey a principled development. W4. The experiments are rather weak and flawed for validating the claims. 1) While the paper motivates from the issues of using user browsing/purchase data, it uses such data for training. The logic is flawed. 2) The baselines for comparison are extremely weak. In particular, its strongest baseline, which uses collaborative filtering over "the set of users who reviewed a and the set of users who reviewed b, is impractical and flawed. Users review an item after they have purchased it, and few users write reviews. How do we expect many *active* users who not only purchased both a and b (they are substitutes to each other!) and write reviews? It is not surprising that the results report an order of magnitude improvement in accuracy. 3) It does not compare to the "co-counting" approach as a baseline to validate its claim. While the paper motivates from the weakness of co-counting approach but uses it as training data, it misses to compare to the co-counting as a baseline and validate the claims. 4) The evaluation of cold start is rather informal and insufficient. Given that review data is much more sparse than browsing/purchase data, cold-start is a serious issue and should be seriously studied. 5) The claims of the paper in the introduction, for resisting noises and providing explanations, are not validated at all in the experiments. *** List of typos, grammatical errors and/or concrete suggestions to improve presentation None. Masked Reviewer ID: Assigned_Reviewer_4 Review: Question *** How would you rate the novelty of the problem solved in this paper? A minor variation of some well studied problems *** How would you rate the technical ideas and development in this paper? Substantial improvement over state-of-the-art methods *** How would you rate the empirical study conducted in this paper? Acceptable, but there is room for improvement *** Repeatability: are the data sets used publicly available (and thus the experiments may be repeated by a third party)? Yes *** How would you rate the quality of presentation? A very well-written paper, a pleasure to read. No obvious flaws. *** What is your overall recommendation? Weak accept. I vote for acceptance, although would not be upset if it were rejected. *** List up to 3 particular strengths of the paper. If none, just say "none". - Well-written - Interesting dataset - Clean/natural model combining topic modeling and link prediction * List up to 3 particular weaknesses of this paper. If none, just say "none". - Novelty/contribution is not clearly stated - Missing a clearly related work - Evaluation methodology could have been better and clearer *** Detailed comments for the authors; justification of your overall rating. As a non-economist, I enjoyed the introduction to the substitutes and complements classification task. It of course is sensible, and a natural extension of the product recommendation problem. I also very much like the idea of learning 'relatedness' from data. Unfortunately, neither the task nor the methods appear especially novel. [1] claims to be the first "work trying to identify substitutes and complements on e-commerce websites", and should be identified and contrasted as related work. I have some concerns about the experimental methodology and in particular the evaluation. Cross-validation would have been nice (eliminate any strange performance effects of the one random split), as would describing exactly what a split means. For example, are you splitting the data, e.g., edges only, such that other edges of the same node are in the training data? Or are you splitting data by nodes, and some edges cross the train/test boundaries? It also seems that cold-start performance is better than performance on reviews but it would be helpful to see performance on reviews on the subset of data used for cold-start to be certain. The ground-truth of the experiments depend on an unknown algorithm -- the ground truth comes from crawling the "User who viewed/bought" recommendations on Amazon, whose presentation is under control of Amazon with no guarantees of the mechanism used. Why are you reporting (only) accuracy? Links are fairly sparse (e.g., links in Men's clothing appears to be about 1 out of 16K). Thus, predicting no link should produce almost perfect accuracy. Other measures would highlight true positives. For many domains, link prediction is extremely difficult (as it is in this domain as revealed by very low precision@k graphs in figure 4). The Category-Tree baseline explanation is unclear, since it describes co-counts between categories, yet you are predicting edges between products (not categories). What fraction of the Amazon catalog did you crawl? What crawling policy did you use (e.g., breadth first, etc.?). When was it crawled? Did you somehow crawl just the popular parts of the Amazon catalog? I suggest cutting half of table 4 to make room for the additional text. [1] Jiaqian Zheng, Xiaoyuan Wu, Junyu Niu, and Alvaro Bolivar. 2009. Substitutes or complements: another step forward in recommendations. In Proceedings of the 10th ACM conference on Electronic commerce (EC '09). ACM, New York, NY, USA, 139-146. DOI=10.1145/1566374.1566394 http://doi.acm.org/10.1145/1566374.1566394 *** List of typos, grammatical errors and/or concrete suggestions to improve presentation When I read "Further Related Work" on page 2, I went back to see if I had missed a discussion of related work, and only found 2 citations. The word "Further" seems inappropriate. Some language in 2.2.2 (right above Eq 2) seems imprecise (Beta isn't predicting anything). The sentence just before section 3.5 needs to be revised. #################################################################################################### WWW (reject) #################################################################################################### ----------------------- REVIEW 1 --------------------- PAPER: 1295 TITLE: Finding Substitutes and Complements from Online Reviews AUTHORS: Julian Mcauley, Rahul Pandey and Jure Leskovec OVERALL RECOMMENDATION: 1 (Good paper: The paper should be accepted, but I will not champion it) REVIEWER EXPERTISE: 2 (Some familiarity: Generally aware of the area) Originality of work: 3 (Creative: Few people in our community would have put these ideas together) Potential impact of results: 3 (Broad: Could help ongoing research in a broader research community) Quality of execution: 3 (Reasonable: Generally solid work, but certain claims could be justified better) Quality of presentation: 3 (Reasonable: Understandable to a large extent, but parts of the paper need more work) Adequacy of citations: 4 (Comprehensive: Can't think of any important paper that is missed) ----------- PAPER SUMMARY ----------- The paper presents the design of a recommendation system for "complements" and "substitutes" products in a catalog of millions of products, by utilizing the product reviews. The ranker is a logistic aggressor for predicting asymmetric relations, which uses as features latent topics as well as additional features such as product price, category in a given hierarchy, manufacturer, etc. Evaluation over a large dataset shows significant improvement over reasonable baselines. ----------- REASONS TO ACCEPT ----------- 1. a novel task 2. showing the benefit of supervised LDA for the task 3. novel joint training of LDA and logistic regression for predicting asymmetric graph edges indicating product-product relationships 4. large scale evaluation over product pairs mined from Amazon, with a promise to release the dataset 5. significant improvement over several baselines 6. analysis of the item cold-start sub-task ----------- REASONS TO REJECT ----------- 1. the exact implementation of the model is not thoroughly presented, especially which are the additional features used in the logistic regression, which seem to provide a very high baseline, and what is the textual preprocessing of the text to which the LDA and TF-IDF are employed (e.g. how ngrams are extracted). 2. why a baseline without textual features, which according to the authors provide a rather large leverage to classification, was not used instead of the weak random baseline? 3. I am not sure if a state-of-the-art item/item recsys was used as a baseline 4. no good analysis of the difference between the pre-trained LDA topics and the supervised topics, though experiments show the benefit of the later one 5. several off topic discussions, such as item browsing and social network recommendations without real evaluation, instead of additional analysis of the main task in the paper ----------- COMMENTS FOR AUTHORS ----------- the discussion on additional tasks could be shrank in favor of more implementation details to help other research in re-implementation ----------------------- REVIEW 2 --------------------- PAPER: 1295 TITLE: Finding Substitutes and Complements from Online Reviews AUTHORS: Julian Mcauley, Rahul Pandey and Jure Leskovec OVERALL RECOMMENDATION: 0 (OK paper: I hope we can find better papers to accept) REVIEWER EXPERTISE: 3 (Knowledgeable: Knowledgeable in this sub-area) Originality of work: 2 (Conventional: Rather straightforward, a number of people could have come up with this) Potential impact of results: 2 (Limited: Impact limited to improving the state-of-the-art for the problem being tackled) Quality of execution: 3 (Reasonable: Generally solid work, but certain claims could be justified better) Quality of presentation: 3 (Reasonable: Understandable to a large extent, but parts of the paper need more work) Adequacy of citations: 3 (Reasonable: Coverage of past work is acceptable, but a few papers are missing) ----------- PAPER SUMMARY ----------- The authors present a system, Sceptre, that learns concepts of substitute and complement items from unstructured text with the goal of predicting relationships between products. The techniques behind the model include LDA, logistic regression and link prediction. The prototype is available online and it is possible to browse the different recommendations to get a better idea of the paper. ----------- REASONS TO ACCEPT ----------- - Working system. - Use large data set. ----------- REASONS TO REJECT ----------- - No user evaluation. - Poor data analysis given the promise of the paper and the size of the data set. - Quality issues are not explored in detail. ----------- COMMENTS FOR AUTHORS ----------- The authors do a very good job describing the system and underlying techniques. However, there is no user evaluation on the quality of the recommendations. The data set for the experiments contains 9M products from Amazon, a very rich data set for exploring the problem space. However, the data analysis is very shallow. One can expect that certain categories would have more or less substitutes or complements (what is a complement for a book?) but there little, if nothing, on the submission. Of the 6 categories presented in table 2, only two (electronics and men’s clothing) are selected for some analysis. Unfortunately, women’s clothing is not presented, making it difficult to assess the quality of the topics in clothing (men vs. women). The topic analysis (section 3.6) is over simplified with lots of open questions about the quality and potential improvements for tighten the topics within categories. A few observations: - Synonyms: e92 (little, small, mini), c75 (expandable waist, elastic waist) - Missing topics: e111 (Nikon, Olympus, and Sony are missing). Search for “cameras” in Amazon and they got it right. - Diversity of brands vs. description: c52 is a good example of balance between brands and description. Maybe this should be the optimizing function for topic generation. As stated before, there is no user evaluation on the quality of the recommendations so I did play with the system, which works and has good performance. Initial observations after trying a few examples: - Quality of substitutes is much better than complements. However, ranking is still a problem. Recommendation for “Dockers” returns a cargo pant as a substitute (positions #1 and #2), which to this reviewer is not a good recommendation. By the way, c75 doesn’t have “cargo pants” so I’m not quite sure why it was ranked so high. - Explanations have lots of quality problems. Using Timberland Men’s Chocorua as example. 1) Highlighting issues (marking with what the system returns): Well made, nice looking, good soles, waterproof, good arch and ankle support -> waterproof, arch and ankle support look like more relevant to highlight. 2) Bad quality (garbled text, poor content): Nicely packaged etc etc etc etc etc etc etc . hello fine fine the proces and buy il like a model for empled fine fine fine fine fine it likd 3) Bad recommendation (people are not recommending the product, yet the system shows it): But ten years went on more like 15 years, and the boots got super uncomfortable. The shoe s did not fit any way like my old one s do i was un happy with them ----------------------- REVIEW 3 --------------------- PAPER: 1295 TITLE: Finding Substitutes and Complements from Online Reviews AUTHORS: Julian Mcauley, Rahul Pandey and Jure Leskovec OVERALL RECOMMENDATION: 1 (Good paper: The paper should be accepted, but I will not champion it) REVIEWER EXPERTISE: 3 (Knowledgeable: Knowledgeable in this sub-area) Originality of work: 3 (Creative: Few people in our community would have put these ideas together) Potential impact of results: 2 (Limited: Impact limited to improving the state-of-the-art for the problem being tackled) Quality of execution: 3 (Reasonable: Generally solid work, but certain claims could be justified better) Quality of presentation: 4 (Lucid: Very well written in every aspect, a pleasure to read, easy to follow) Adequacy of citations: 4 (Comprehensive: Can't think of any important paper that is missed) ----------- PAPER SUMMARY ----------- In this paper, the authors use the review text for various products to determine if a pair of products are substitutes or complements. Substitutes are products that can be purchased instead of each other, where as complements are products like accessories which can be purchased in addition. The approach mentioned in the paper uses a combination of LDA topics and logistic regression. In particular their model trains LDA topics such that these topic vectors can be used for the complements/substitution prediction tasks. The authors formulate an objective function that maximizes the likelihood of the data based on the topics and is able to distinguish an edge from a non-edge. They use an off the shelf gradient based solver to optimize the function. The authors present an extensive experimental analysis over a real world dataset culled from Amazon. ----------- REASONS TO ACCEPT ----------- 1. Well written, I especially liked how the authors provide intuitions for their modeling in Section 2.2, adding additional features one-by-one. 2. Using the product taxonomies to guide the topic sampling is a neat idea. Provides a nice way to exploit the known information to generate meaningful, interpretable topics. ----------- REASONS TO REJECT ----------- 1. (Section 4) Using the logistic model, which was originally trained to predict a link between 2 edges, to compare across edges seems erroneous, since we didn't actually train the model to rank across edges. 2. The main thesis of the paper was to distinguish between a complementary item with that of a substitute item. However, I do not see any empirical analysis corresponding to this task. Section 3.4 talks about link prediction, however, it does not mention whether this is on the complement graph or the substitute graph. I would be more interested to know the accuracy on each of these graphs separately -- essentially if we are good at one task compared to another. 3. While the algorithms proposed are quite interesting and novel, the problem of detecting substitutes and complements does not usually occur in practice. Most merchants/manufacturers selling the products usually reveal this information as part of the product description -- so I'm not sure I see the need for the prediction here. 4. The experimental section is well described, but it does not prove the claims made by the paper. For instance, we don't know how well the model predicts a complementary item from that of a substitute ? ----------- COMMENTS FOR AUTHORS ----------- You mention that the dataset considered was very large -- order of hundreds of millions of edges in the graph. How are you scaling the computation to such large graphs. Are you assuming that your entire matrix fits in memory ? Section 3.2.3: Why do we need to select a balanced dataset ? Can we improve the model accuracy by training over more negative pairs ? Would it also make sense to sample a different negative example in each iteration ? The results in Table 3 are quite striking actually. Although the topic features were being computed by the vw model, they don't perform as well as the Sceptre method. I guess joint training helps after all, especially in this case. Section 3.4.1: How are you generating the recommendations via Sceptre ? Until now, we have only been able to identify subs and complements. Please explain how this translates to a top-k recommendation -- is this using the predicted logistic score. Section 3.7 on friend recommendations seems to be out of place with the rest of the paper -- since most of your contributions have been to distinguish a complement from a substitute (which is clearly much harder than a traditional link/friend prediction). Section 3.1 -- Did you really mean 1 TB of RAM ? ----------------------- REVIEW 4 --------------------- PAPER: 1295 TITLE: Finding Substitutes and Complements from Online Reviews AUTHORS: Julian Mcauley, Rahul Pandey and Jure Leskovec OVERALL RECOMMENDATION: 0 (OK paper: I hope we can find better papers to accept) REVIEWER EXPERTISE: 3 (Knowledgeable: Knowledgeable in this sub-area) Originality of work: 3 (Creative: Few people in our community would have put these ideas together) Potential impact of results: 2 (Limited: Impact limited to improving the state-of-the-art for the problem being tackled) Quality of execution: 3 (Reasonable: Generally solid work, but certain claims could be justified better) Quality of presentation: 3 (Reasonable: Understandable to a large extent, but parts of the paper need more work) Adequacy of citations: 4 (Comprehensive: Can't think of any important paper that is missed) ----------- PAPER SUMMARY ----------- METAREVIEW ----------- REASONS TO ACCEPT ----------- - Novel problem - Joint modeling of LDA and logistic regression (for asymmetric link prediction) ----------- REASONS TO REJECT ----------- - The proposed problem has limited impact - Technical contribution is limited--essentially a link prediction formulation - Quality evaluation need further improvement--some claims are not well addressed by experimental analysis ----------- COMMENTS FOR AUTHORS ----------- METAREVIEW ------------------------- METAREVIEW ------------------------ There is no metareview for this paper