ICDM 2016 (accept) --======== Review Reports ========-- The review report from reviewer #1: *1: Is the paper relevant to ICDM? [_] No [X] Yes *2: How innovative is the paper? [_] 6 (Very innovative) [X] 3 (Innovative) [_] -2 (Marginally) [_] -4 (Not very much) [_] -6 (Not at all) *3: How would you rate the technical quality of the paper? [_] 6 (Very high) [X] 3 (High) [_] -2 (Marginal) [_] -4 (Low) [_] -6 (Very low) *4: How is the presentation? [_] 6 (Excellent) [X] 3 (Good) [_] -2 (Marginal) [_] -4 (Below average) [_] -6 (Poor) *5: Is the paper of interest to ICDM users and practitioners? [X] 3 (Yes) [_] 2 (May be) [_] 1 (No) [_] 0 (Not applicable) *6: What is your confidence in your review of this paper? [_] 2 (High) [X] 1 (Medium) [_] 0 (Low) *7: Overall recommendation [_] 6: must accept (in top 25% of ICDM accepted papers) [X] 3: should accept (in top 80% of ICDM accepted papers) [_] -2: marginal (in bottom 20% of ICDM accepted papers) [_] -4: should reject (below acceptance bar) [_] -6: must reject (unacceptable: too weak, incomplete, or wrong) *8: Summary of the paper's main contribution and impact In this paper, a method to model heterogeneous relationships for item-to-item recommendation tasks has been proposed. The proposed method made use of ‘mixtures’ of non-metric embeddings, which relaxes the identity and symmetry assumptions of existing metric-based methods. This approach generates diverse and cross-category recommendations effectively that capture more complex relationships than mere visual similarity. Experimental result using co-purchase and co-browsing data from Amazon showed that the proposed method is accurate at link prediction tasks. And it can make effective recommendations of heterogeneous content. *9: Justification of your recommendation The paper proposed a rather novel idea of doing cross category item-to-item recommendation using heterogeneous data. It is technically sound and has good presentation. Experimental result on Amazon data shows effectiveness on link prediction over 3 existing methods for cross category link prediction. *10: Three strong points of this paper (please number each point) 1. Novel idea of doing cross category item-to-item recommendation 2. Technically sound and good presentation 3. Result shows effectiveness on link prediction *11: Three weak points of this paper (please number each point) 1. lack some details on experiments setup on the random generation of irrelevant set. 2. it is not very clear why it is claimed as "highly scalable" solution *12: Is this submission among the best 10% of submissions that you reviewed for ICDM'16? [X] No [_] Yes *13: Would you be able to replicate the results based on the information given in the paper? [X] No [_] Yes *14: Are the data and implementations publicly available for possible replication? [X] No [_] Yes *15: If the paper is accepted, which format would you suggest? [X] Regular Paper [_] Short Paper *16: Detailed comments for the authors In this paper, a method to model heterogeneous relationships for item-to-item recommendation tasks has been proposed. The proposed method made use of ‘mixtures’ of non-metric embeddings, which relaxes the identity and symmetry assumptions of existing metric-based methods. This approach generates diverse and cross-category recommendations effectively that capture more complex relationships than mere visual similarity. Experimental result using co-purchase and co-browsing data from Amazon showed that the proposed method is accurate at link prediction tasks. And it can make effective recommendations of heterogeneous content. A few comments: It is clear that the approach can handle large scale data, but it is not very clear why it is "high scalable" approach. Is it easy to compute it in parallel? or on multiple machines? It would be good to understand the effect of Irrelevant set on the link prediction or recommendation. Since irrelevant set is randomly generated thus it could have some impact on the link prediction errors. In Figure 2, the visual effect of "women's clothing", it is interesting to see some pictures of men's clothing. Is it data error or something else? The paper makes very interesting heterogeneous recommendations of content. However, the quality of the recommendation is hard to measure. It would be interesting (if possible) to do some A/B testing on the effectiveness of the recommendations. ======================================================== The review report from reviewer #2: *1: Is the paper relevant to ICDM? [_] No [X] Yes *2: How innovative is the paper? [_] 6 (Very innovative) [_] 3 (Innovative) [_] -2 (Marginally) [X] -4 (Not very much) [_] -6 (Not at all) *3: How would you rate the technical quality of the paper? [_] 6 (Very high) [_] 3 (High) [X] -2 (Marginal) [_] -4 (Low) [_] -6 (Very low) *4: How is the presentation? [_] 6 (Excellent) [_] 3 (Good) [X] -2 (Marginal) [_] -4 (Below average) [_] -6 (Poor) *5: Is the paper of interest to ICDM users and practitioners? [X] 3 (Yes) [_] 2 (May be) [_] 1 (No) [_] 0 (Not applicable) *6: What is your confidence in your review of this paper? [X] 2 (High) [_] 1 (Medium) [_] 0 (Low) *7: Overall recommendation [_] 6: must accept (in top 25% of ICDM accepted papers) [_] 3: should accept (in top 80% of ICDM accepted papers) [X] -2: marginal (in bottom 20% of ICDM accepted papers) [_] -4: should reject (below acceptance bar) [_] -6: must reject (unacceptable: too weak, incomplete, or wrong) *8: Summary of the paper's main contribution and impact The authors developed a method for metric learning designed for item-to-item recommender algorithms. Metrics are optimized for each categories, and the learned metrics are then integrated by linear combination. *9: Justification of your recommendation Using hierarchical models is a straightforward approach to deal with heterogeneous data, but experiments are performed thoroughly and carefully. *10: Three strong points of this paper (please number each point) - extensive experiments *11: Three weak points of this paper (please number each point) - straightforward model *12: Is this submission among the best 10% of submissions that you reviewed for ICDM'16? [X] No [_] Yes *13: Would you be able to replicate the results based on the information given in the paper? [_] No [X] Yes *14: Are the data and implementations publicly available for possible replication? [X] No [_] Yes *15: If the paper is accepted, which format would you suggest? [_] Regular Paper [X] Short Paper *16: Detailed comments for the authors I consider that the proposed task is a kind of cross-domain recommendation, and it is better to discuss your approach in this context, clarifying pros and cons of your approach compared to the other approaches of cross-domain recommendation. Good tutorial of cross recommendation can be found in Cantador & Cremonesi "Cross-Domain Recommender Systems" Recsys2014 tutorial https://recsys.acm.org/recsys14/tutorials/#content-tab-1-2-tab Equation numbers in texts should be written with parentheses. In Eq(5), an item y is ignored. Is there any justification? ======================================================== The review report from reviewer #3: *1: Is the paper relevant to ICDM? [_] No [X] Yes *2: How innovative is the paper? [_] 6 (Very innovative) [X] 3 (Innovative) [_] -2 (Marginally) [_] -4 (Not very much) [_] -6 (Not at all) *3: How would you rate the technical quality of the paper? [_] 6 (Very high) [X] 3 (High) [_] -2 (Marginal) [_] -4 (Low) [_] -6 (Very low) *4: How is the presentation? [_] 6 (Excellent) [X] 3 (Good) [_] -2 (Marginal) [_] -4 (Below average) [_] -6 (Poor) *5: Is the paper of interest to ICDM users and practitioners? [X] 3 (Yes) [_] 2 (May be) [_] 1 (No) [_] 0 (Not applicable) *6: What is your confidence in your review of this paper? [X] 2 (High) [_] 1 (Medium) [_] 0 (Low) *7: Overall recommendation [_] 6: must accept (in top 25% of ICDM accepted papers) [X] 3: should accept (in top 80% of ICDM accepted papers) [_] -2: marginal (in bottom 20% of ICDM accepted papers) [_] -4: should reject (below acceptance bar) [_] -6: must reject (unacceptable: too weak, incomplete, or wrong) *8: Summary of the paper's main contribution and impact The paper presents a new recommendation algorithm that recommends heterogeneous items that are compatible. The problem is not trivial but very important. The proposed approach is novel. *9: Justification of your recommendation The paper tackles a difficult problem in recommender system. Overall the proposed approach is new and the experiments are comprehensive and convincing. In the proposed approach, features are first extracted from images and then projected into different low-dimension spaces. A final recommendation is made by combining distances from multiple embedding spaces. The use of multiple embedding spaces models the existence of various reasons that heterogeneous items are matched together. The paper is well organized and presented. *10: Three strong points of this paper (please number each point) 1. new recommendation algorithm; 2. difficult problem; 3. convincing results. *11: Three weak points of this paper (please number each point) 1. need more detailed examples on the embedding spaces; *12: Is this submission among the best 10% of submissions that you reviewed for ICDM'16? [_] No [X] Yes *13: Would you be able to replicate the results based on the information given in the paper? [X] No [_] Yes *14: Are the data and implementations publicly available for possible replication? [_] No [X] Yes *15: If the paper is accepted, which format would you suggest? [X] Regular Paper [_] Short Paper *16: Detailed comments for the authors I feel this is a nice paper to read through. One limitation is that the features are extracted from product images, which is a very special case (if not the easiest case). It is expected to see how the proposed approach works when the features are not from images (e.g., text, or even more difficult, no explicit features). It is expected that the features for the matched items are presented in addition to the embedding spaces, which may validate the presumption. It is also interesting to see if the feature learning can be coupled with recommendation, that is, the learned features will correspond to the reason why items are matched. ======================================================== RecSys 2016 (reject) ======================================================== ----------------------- REVIEW 1 --------------------- PAPER: 155 TITLE: Monomer: Non-Metric Mixtures-of-Embeddings for Learning Visual Compatibility Across Categories AUTHORS: Ruining He, Charles Packer and Julian McAuley OVERALL EVALUATION: 0 (borderline paper) REVIEWER'S CONFIDENCE: 3 (medium) Relevance for RecSys: 5 (excellent) Novelty: 4 (good) Technical quality: 3 (fair) Significance: 4 (good) Presentation and readability: 3 (fair) ----------- Review ----------- Comments to the Author The authors present a method, mixtures of non-metric embeddings for recommendation, to model heterogeneous relationships for item-to-item recommendation tasks, which is shown to perform better than existing methods. From my point of view, the main strong points of this work are: (1) good paper organization; (2) sufficient experiments; and (3) impressive results. The major limitation of this work is: (1) not very clear theoretical explanation. If this limitation could be solved, the readers can better understand the algorithm of this work in real tasks. Minor suggestions: (1), If Figure 4 is only used for showing the clustering problem, sample less items from the dataset could be better. Visualisation of 10,000 items looks not very clear. (2), Figure 3 shows that the LMT tends to recommend items which are very similar to the query. Can the Authors give a quantity measurement for this recommendation? For example, the portion of similar goods and the portion of related goods in the whole recommended goods. If LMT gives 0 percent related goods, then the result of Monomer will be more significant. Question: (1), In section 4.4, I would like to see the experiment process, how to get these visual spaces 2 to 5 by patterns, colours and styles? ----------------------- REVIEW 2 --------------------- PAPER: 155 TITLE: Monomer: Non-Metric Mixtures-of-Embeddings for Learning Visual Compatibility Across Categories AUTHORS: Ruining He, Charles Packer and Julian McAuley OVERALL EVALUATION: -1 (weak reject) REVIEWER'S CONFIDENCE: 4 (high) Relevance for RecSys: 5 (excellent) Novelty: 4 (good) Technical quality: 4 (good) Significance: 3 (fair) Presentation and readability: 4 (good) ----------- Review ----------- A mixtures of non-metric embeddings method is presented. In particular it is useful for modeling heterogeneous relations. The experimental study shows that in some cases it outperforms existing methods. Given the good results this paper can be interesting. But I am missing the theoretical support. Moreover I find it a bit hard to follow the algorithm. ----------------------- REVIEW 3 --------------------- PAPER: 155 TITLE: Monomer: Non-Metric Mixtures-of-Embeddings for Learning Visual Compatibility Across Categories AUTHORS: Ruining He, Charles Packer and Julian McAuley OVERALL EVALUATION: 0 (borderline paper) REVIEWER'S CONFIDENCE: 3 (medium) Relevance for RecSys: 4 (good) Novelty: 3 (fair) Technical quality: 3 (fair) Significance: 2 (poor) Presentation and readability: 4 (good) ----------- Review ----------- The authors addressed the problem of mining relationships between items. The online recommender systems require a powerful component of identifying such relationships. The authors proposed a method called "Mixtures of Non-Metric Embeddings for Recommendation" in which the query item is embedded in to one space, and the comparable items are embedded into multiple spaces, and a short distance in any space means "relatedness". The experimental design and numerical results look good. However, this paper has the following fatal weak points. 1) weak novelty towards a recommendation application: Certainly the users require rich relationships between items. However, how to define the concrete types of the relationships? Previously we believe a click on "looking for related items" will return us similar items like the baseline pictures in Figure 3. The proposed method returns the multiple types of "related" items but it is difficult to know whether all of them are expected - why given a shoe, we want a watch, not socks? 2) weak assumption which breaks the trade-off between precision and recall: Using multiple visual spaces to embed items and evaluating the distance in every space will give us higher recall, but the precision will decrease seriously. Do the spaces have equal importance? Are the distances able to be evaluated in the same scale? How to reasonably evaluate the item's relationships from different relation types (visual spaces)? 3) too strong claim in the limitations: The authors said previous approaches relied on "nearest-neighbor" assumptions. However, their approach also relies on the "NN" assumption, though the previous relied on NN in a single space, while this method relies on NN in multiple spaces. It is not correct to derive the model to address this "limitation". There are also several small issues: 1) Abstract: uncover complicated and heterogeneous of relationships -> remove "of" 2) The contributions are not clear. The authors should summarize their contributions in bullets as well as the challenges. 3) Table 4 shows the test errors. Numerical values (w/ standard variance) should be presented. It is not professional to show the percentages. ======================================================== SIGIR 2016 (reject) ======================================================== ------------- Review from Reviewer 1 ------------- Relevance to SIGIR (1-5, accept threshold=3) : 4 Originality of Work (1-5, accept threshold=3) : 3 Technical Soundness (1-5, accept threshold=3) : 4 Quality of Presentation (1-5, accept threshold=3) : 4 Impact of Ideas or Results (1-5, accept threshold=3) : 4 Adequacy of Citations (1-5, accept threshold=3) : 4 Reproducibility of Methods (1-5, accept threshold=3) : 3 Overall Recommendation (1-6) : 3 -- Comments to the author(s): This is the meta-review by the Primary PCM responsible for your paper, and takes into account the opinions expressed by the referees, the subsequent decision thread, and my own opinions about your work. There was considerable discussion about this paper. Firstly, I appreciated the way the paper was written, in particular I felt the model was well presented and the intuitions behind various formal aspects in the model were well conveyed. There was a claim about lack of novelty with respect to [20] Indeed the paper is similar to [20] with respect to using the Mahalanobis distance. However, to me, the innovation was in section 3.3 where probabilistic mixtures are used, which is I deem where the novelty of the paper lies. I also feel that the novelty, though not major, was sufficient for SIGIR. It should be mentioned that there was another issue that clouded the novelty of this paper. The PC chairs informed me that one of the authors of the paper is also an author of a similar paper to be published at WWW 2016. In my judgement I did not take this into account because this paper is not published. However, a transparent approach that the authors could have adopted would be to mention work that is about to be published as well. However, there are significant issues regarding the evaluation. Only one dataset is used which does not demonstrate the generality of the solution, But more importantly, it seems that used only one split of the dataset into training/validation/test was used. With this, it could be the case that the authors ``got lucky鋳 in the single sample they use and we do not know how the learning procedure would behave in different test sets, i.e., the variability of the results with different learning and different unseen test sets. Standard folded cross-validation should have been employed. This also prevented use of statistical significance tests to check whether their results are really superior to the baselines. In face of close results, such as those in Table 3 for the Men and Women category, this is essential. I also seems the authors built an artificial unrealistic balanced dataset and we have no idea how this may favour their method and the baselines. Finally, I agree with one of the reviewers regarding parameterization. In the way the text is written it seems that the authors chose the best parameters that favoured their method *in the test set*, which is not appropriate. It is not clear whether they the same was done for all baselines. All parameters should have been learned in the validation set for all methods to be fair. There were some doubts expressed about efficiency. I did not weight this aspect very highly in my judgement, but the authors should analyse this aspect to show the model can be feasibly employed in a real setting. -- Summary: In summary, a well presented model with sufficient novelty, however the evaluation of this model has significant issues which prevent its acceptance at SIGIR ---------- End of Review from Reviewer 1 ---------- ------------- Review from Reviewer 2 ------------- Relevance to SIGIR (1-5, accept threshold=3) : 4 Originality of Work (1-5, accept threshold=3) : 3 Technical Soundness (1-5, accept threshold=3) : 3 Quality of Presentation (1-5, accept threshold=3) : 3 Impact of Ideas or Results (1-5, accept threshold=3) : 3 Adequacy of Citations (1-5, accept threshold=3) : 3 Reproducibility of Methods (1-5, accept threshold=3) : 3 Overall Recommendation (1-6) : 3 -- Comments to the author(s): As Secondary PCM I have reviewed this submission, the reviews as well as the discussion and I concur with the decision. -- Summary: Same as above. ---------- End of Review from Reviewer 2 ---------- ------------- Review from Reviewer 3 ------------- Relevance to SIGIR (1-5, accept threshold=3) : 4 Originality of Work (1-5, accept threshold=3) : 3 Technical Soundness (1-5, accept threshold=3) : 4 Quality of Presentation (1-5, accept threshold=3) : 4 Impact of Ideas or Results (1-5, accept threshold=3) : 4 Adequacy of Citations (1-5, accept threshold=3) : 4 Reproducibility of Methods (1-5, accept threshold=3) : 3 Overall Recommendation (1-6) : 3 -- Comments to the author(s): As Secondary PCM I have reviewed this submission, the reviews as well as the discussion and I concur with the decision. -- Summary: As Secondary PCM I have reviewed this submission, the reviews as well as the discussion and I concur with the decision. ---------- End of Review from Reviewer 3 ---------- ------------- Review from Reviewer 4 ------------- Relevance to SIGIR (1-5, accept threshold=3) : 4 Originality of Work (1-5, accept threshold=3) : 3 Technical Soundness (1-5, accept threshold=3) : 4 Quality of Presentation (1-5, accept threshold=3) : 5 Impact of Ideas or Results (1-5, accept threshold=3) : 4 Adequacy of Citations (1-5, accept threshold=3) : 4 Reproducibility of Methods (1-5, accept threshold=3) : 4 Overall Recommendation (1-6) : 4 -- Comments to the author(s): This paper presents a "non-metric" embedding approach for using item attribute information in recommendations. The need for a non-metric approach is well-motivated in the introduction, the proposed methods address the need appropriately, state-of-the-art techniques are leveraged where possible, and the experiments are extensive and show positive results. I have several (relatively minor) suggestions, below. Minor suggestions: I think the discussion around equation 4 uses the terms "anchor space" and "support space" interchangably. Or, I didn't understand the distinction if there is one. Figure 1 does help to clarify the approach, but the part that is hard to understand from the figure is how the approach does not that an item is maximally related to itself. Where would the t-shirt be if it were embedding into visual space 2, for example? I was concerned about transforming distance into probability in the manner shown in the Figure, where probability of relatedness is proportional to the inverse of summed distance. But it turns out that the proportionality in the Figure isn't really what is done in the system (instead, it's equation 7, a standard sigmoid transformation of distance). I would change Figure 1 to reflect what's really done. Figure 3 ends up being a little misleading. From the zoom-ins I might conclude that Visual space 1 is shirts, space 2 is pants/shirts, space 3 is shoes. But that's not really the case, it's just that those local snippets of the space fall into those categories. Figure 4 helps clarify this. The paper claims that its scalability is superior to similar previous work such as [5]; establishing this rigorously (theoretical or empirically) would improve the paper. I think Figure 5 could be augmented to be more fair to previous work. If [20] evaluated in a setting where the recommended items come from specific categories, showing at least one example in such a query setting might be helpful. -- Summary: This is a strong paper that presents a well-motivated technique. While not a creative breakthrough in modeling (instead, it puts together standard approaches like embeddings and mixtures of experts in a new way), it uses the right components in the right way to achieve good results. ---------- End of Review from Reviewer 4 ---------- ------------- Review from Reviewer 5 ------------- Relevance to SIGIR (1-5, accept threshold=3) : 4 Originality of Work (1-5, accept threshold=3) : 2 Technical Soundness (1-5, accept threshold=3) : 5 Quality of Presentation (1-5, accept threshold=3) : 5 Impact of Ideas or Results (1-5, accept threshold=3) : 5 Adequacy of Citations (1-5, accept threshold=3) : 5 Reproducibility of Methods (1-5, accept threshold=3) : 5 Overall Recommendation (1-6) : 3 -- Comments to the author(s): The paper introduces the use of multiple feature spaces to establish relationships between heterogenous data, in the context of item recommendation. This is an very well written paper that addresses a very important problem that can have applications in many areas relating to information retrieval, beyond recommender systems. The paper makes a particularly solid reference to related work, and proposes implementations and evaluation with both visual features and text features. The experiments seem very solid and the results are very clearly presented. Unfortunately despide all these qualities the proposed model is not very novel, neither is the presented application, so the contribution of the paper is quite incremental. Therefore it would be more appropriate as a short paper. -- Summary: In summary, the paper is of very high quality and addresses an important problem relating to many applications in information retrieval. However the novelty is lacking to make it a strong contribution as a full paper for SIGIR. ---------- End of Review from Reviewer 5 ---------- ------------- Review from Reviewer 6 ------------- Relevance to SIGIR (1-5, accept threshold=3) : 3 Originality of Work (1-5, accept threshold=3) : 2 Technical Soundness (1-5, accept threshold=3) : 3 Quality of Presentation (1-5, accept threshold=3) : 4 Impact of Ideas or Results (1-5, accept threshold=3) : 3 Adequacy of Citations (1-5, accept threshold=3) : 4 Reproducibility of Methods (1-5, accept threshold=3) : 2 Overall Recommendation (1-6) : 3 -- Comments to the author(s): This paper addresses the task of identifying related items based on visual features, by proposing probabilistic mixtures of embeddings based on the idea of mixtures of experts. The experiments were conducted on a dataset from Amazon. The paper is well organized in general. However, there exist several issues. First of all, the novelty of the paper is limited given the prior work in [20]. The paper addresses the same task with a similar model with the extensions from a single embedding to multiple embeddings. The extension is not inspiring. Moreover, the paper lacks the details of model learning. It is unclear how the parameters are estimated while the authors only mention that the model is learned with L-BFGS. The paper should show the derivatives of the likelihood function with respect to the parameters so that others can follow and replicate the work. The paper pointed out that the proposed algorithm is efficient because the total number of parameters is O(F*N*K). But in some cases where N and K are large, there would be a huge number of parameters. In fact, the proposed model has significantly more parameters to estimate than the traditional embedding methods such as LMT do. The experiments are not very convincing. The authors should conduct some analysis on the impact of the dimensionality of the latent space K and the number of embeddings N. It is not a fair comparison that K is set to 100 for LMT and 20 for Monomer without justification. Furthermore, the dataset was constructed in the way that the number of positive relationships is equal to the number of non-relationships. This seems artificial and does not reflect the real distribution because in the real world randomly picked two items are most likely not related with each other. Thus, there should be much more non-relationships than positive relationships in the test data. In addition, the authors should use the standard evaluation metrics such as Precision, Recall, or F measure. At least, they should clearly show how their evaluation metric, Error, defined. -- Summary: This paper is well organized. However, the novelty of the work is limited given the prior work. More details on model learning need to be presented. The experiments are not convincing. ---------- End of Review from Reviewer 6 ----------