ICDM 2017 (accept): ========================================================================== --======== Review Reports ========-- The review report from reviewer #1: *1: Is the paper relevant to ICDM? [_] No [X] Yes *2: How innovative is the paper? [_] 6 (Very innovative) [X] 3 (Innovative) [_] -2 (Marginally) [_] -4 (Not very much) [_] -6 (Not at all) *3: How would you rate the technical quality of the paper? [_] 6 (Very high) [X] 3 (High) [_] -2 (Marginal) [_] -4 (Low) [_] -6 (Very low) *4: How is the presentation? [X] 6 (Excellent) [_] 3 (Good) [_] -2 (Marginal) [_] -4 (Below average) [_] -6 (Poor) *5: Is the paper of interest to ICDM users and practitioners? [X] 3 (Yes) [_] 2 (May be) [_] 1 (No) [_] 0 (Not applicable) *6: What is your confidence in your review of this paper? [X] 2 (High) [_] 1 (Medium) [_] 0 (Low) *7: Overall recommendation [X] 6: must accept (in top 25% of ICDM accepted papers) [_] 3: should accept (in top 80% of ICDM accepted papers) [_] -2: marginal (in bottom 20% of ICDM accepted papers) [_] -4: should reject (below acceptance bar) [_] -6: must reject (unacceptable: too weak, incomplete, or wrong) *8: Summary of the paper's main contribution and impact The authors propose a method to recommend visual images of clothes. Two types of recommendation is proposed. The one is selecting images fit for the user's taste. The other is generating images that would be preferred by the user. *9: Justification of your recommendation The techniques used in this system is not new, but the system is totally well designed by selecting and combining appropriate techniques. I consider that this type of system is a new direction of recommendation. *10: Three strong points of this paper (please number each point) - well designed system - excellent presentation *11: Three weak points of this paper (please number each point) - unclear motivation of recommending imaginary items generated by a GAN *12: Is this submission among the best 10% of submissions that you reviewed for ICDM'17? [_] No [X] Yes *13: Would you be able to replicate the results based on the information given in the paper? [_] No [X] Yes *14: Are the data and implementations publicly available for possible replication? [_] No [X] Yes *15: If the paper is accepted, which format would you suggest? [X] Regular Paper [_] Short Paper *16: Detailed comments for the authors ### When you use this system in an e-commerce site, the customers would not be able to buy the clothes or accessaries generated by a GAN. Do you have any idea about this point? I consider that it would be interesting to generate new coordinates of clothes preferred by users. ### GANs sometimes generates collapsed images. This property is not good for recommendation, because such recommendation worsen the trust to the system. Do you have any idea to avoid bad images? ### MINOR COMMENTS - Equation numbers should be written with parentheses. ======================================================== The review report from reviewer #2: *1: Is the paper relevant to ICDM? [_] No [X] Yes *2: How innovative is the paper? [_] 6 (Very innovative) [X] 3 (Innovative) [_] -2 (Marginally) [_] -4 (Not very much) [_] -6 (Not at all) *3: How would you rate the technical quality of the paper? [_] 6 (Very high) [X] 3 (High) [_] -2 (Marginal) [_] -4 (Low) [_] -6 (Very low) *4: How is the presentation? [X] 6 (Excellent) [_] 3 (Good) [_] -2 (Marginal) [_] -4 (Below average) [_] -6 (Poor) *5: Is the paper of interest to ICDM users and practitioners? [_] 3 (Yes) [X] 2 (May be) [_] 1 (No) [_] 0 (Not applicable) *6: What is your confidence in your review of this paper? [_] 2 (High) [X] 1 (Medium) [_] 0 (Low) *7: Overall recommendation [_] 6: must accept (in top 25% of ICDM accepted papers) [X] 3: should accept (in top 80% of ICDM accepted papers) [_] -2: marginal (in bottom 20% of ICDM accepted papers) [_] -4: should reject (below acceptance bar) [_] -6: must reject (unacceptable: too weak, incomplete, or wrong) *8: Summary of the paper's main contribution and impact In this paper, the goal is to build up a fashion recommendation system and a generative image model to personalize user’s design based on their preferences. The authors build up a visual recommendation by combining a Siamese CNN and Bayesian Personalized Ranking (BPR). In BPR, the recommendation problem is formulated as a maximization of preference score differences between items, which the user has expressed interests, and items, which the user hasn’t (i.e., the preferences scores of items, which users are interested in, should be larger than the preference scores of items, which doesn’t interest users). The main contribution is that, instead of using the pre-extracted CNN feature representation and the embedding matrix as a part of preference scores (besides of bias and offset terms), they use the Siamese CNN network to extract visual features directly from the images themselves. In the optimization objective function, the preference score differences include the weights of Siamese CNN instead of the embedding matrix with pre-extracted image features being fixed. The second contribution is combining generative adversarial networks (GAN) with previous proposed trained models to build up a generative model, which can generate images by maximizing the preference score from the output of GAN. *9: Justification of your recommendation Besides of the idea of combining the Siamese CNN and the BPR to directly optimize the preference score differences, the second proposed model, which uses outputs of GAN as inputs of proposed image model to generate new and plausible images for designers, is interesting and the quality of the result images are quiet well. In the experiments, not only matrix factorization (MF) but traditional BPR is also included. The authors compare the performances between MF and the approaches with visual features. They further compare the differences between models of pre-trained features and models with image features learned for the task of recommendations. The experiments are thorough and comprehensive. *10: Three strong points of this paper (please number each point) 1). The experiments are very rich in content and logically organized. The first part is to compare between MF methods and MP. The authors go further to compare methods with and without visual features. The last one is to compare their proposed idea of learning the visual representation from preferences. In the fig.4 and fig.5, the authors give some examples of how the generated images look like, which responds the reader’s curiosity about how plausible the synthesized images are. 2). The problem formulation and optimization objective functions in this paper are also well-written. It leads readers to follow the authors’ logic from how learning features from the image based on preferences directly can improve the performances to how their proposed approaches outperform the traditional methods with pre-trained visual features. 3). In the experiments of new image generations, the authors study the impact of the hyperparameter η, which is used to control the trade-off between the image quality and preferences scores. The image quality is essential for image generation especially for a fashion design task. The authors use figures to clearly address how they choose the trade-off parameter. *11: Three weak points of this paper (please number each point) 1). In equation 3, the reasons why the bias terms and non-visual latent factors are excluded are not clear. The authors suggest that the remaining terms are sufficient empirically. If the bias terms are non-significant, they should be given a lower value during the process of training model. The authors may give some experimental results or some further explanations about why the omission of these terms can improve the overall performance. 2). The running time may need to be included. All experiments have been conducted on the workstation, it would be better for the authors to give a simple running time comparison between the best model and the second best one. Since the feature is no longer pre-extracted, the potential trade-off between the running time and performances can be discussed. 3). The parameters of the proposed method may need to be elaborated more clearly. In section IV, the baseline subsection lists all methods for experiments. For the proposed method, the authors specify that they use a single parameter setting for all dataset since they claim they found out their proposed method is not sensitive to parameters. The authors may need to discuss the parameters, which they use for DVBPR, and run a grid search on all datasets. The parameter, minibatch size, is not explained in paper as well. *12: Is this submission among the best 10% of submissions that you reviewed for ICDM'17? [X] No [_] Yes *13: Would you be able to replicate the results based on the information given in the paper? [_] No [X] Yes *14: Are the data and implementations publicly available for possible replication? [X] No [_] Yes *15: If the paper is accepted, which format would you suggest? [X] Regular Paper [_] Short Paper *16: Detailed comments for the authors 1). In table II, the datasets has been listed and table IV shows evaluation results of images from four sources. However, the number of categories of dataset Tradesy.com is not specified. In image generation, categories should be given as inputs. It would be better for authors to elaborate some more detailed information about how they deal with dataset without categories in their algorithm for experimental reproducibility. 2). Some minor errors should be corrected. For example, in equation 9, the last equality is for the outputs of GAN model and it doesn’t equal to the first line, which is the preference score. In the baseline method, ”MF methods vs. MP” are brought up without further descriptions of what the MP is. ======================================================== The review report from reviewer #3: *1: Is the paper relevant to ICDM? [_] No [X] Yes *2: How innovative is the paper? [_] 6 (Very innovative) [X] 3 (Innovative) [_] -2 (Marginally) [_] -4 (Not very much) [_] -6 (Not at all) *3: How would you rate the technical quality of the paper? [_] 6 (Very high) [X] 3 (High) [_] -2 (Marginal) [_] -4 (Low) [_] -6 (Very low) *4: How is the presentation? [_] 6 (Excellent) [X] 3 (Good) [_] -2 (Marginal) [_] -4 (Below average) [_] -6 (Poor) *5: Is the paper of interest to ICDM users and practitioners? [_] 3 (Yes) [X] 2 (May be) [_] 1 (No) [_] 0 (Not applicable) *6: What is your confidence in your review of this paper? [_] 2 (High) [X] 1 (Medium) [_] 0 (Low) *7: Overall recommendation [_] 6: must accept (in top 25% of ICDM accepted papers) [X] 3: should accept (in top 80% of ICDM accepted papers) [_] -2: marginal (in bottom 20% of ICDM accepted papers) [_] -4: should reject (below acceptance bar) [_] -6: must reject (unacceptable: too weak, incomplete, or wrong) *8: Summary of the paper's main contribution and impact This paper has presented a system for fashion recommendation that is capable not only of suggesting existing items to a user, but which is also capable of generating new, plausible fashion images that match user preferences. *9: Justification of your recommendation Accept for an interesting application using generative image model to do fashion recommendation and design with visual signals *10: Three strong points of this paper (please number each point) 1. interesting application for recommendation methodology. 2. Innovational propose visually-aware recommender system 3. More metrics are proposed in this work to measure the system in quantity and quality *11: Three weak points of this paper (please number each point) nul *12: Is this submission among the best 10% of submissions that you reviewed for ICDM'17? [X] No [_] Yes *13: Would you be able to replicate the results based on the information given in the paper? [X] No [_] Yes *14: Are the data and implementations publicly available for possible replication? [X] No [_] Yes *15: If the paper is accepted, which format would you suggest? [_] Regular Paper [X] Short Paper *16: Detailed comments for the authors nul ======================================================== SIGIR 2017 (reject): ----------------------- REVIEW 1 --------------------- PAPER: 27 TITLE: Visually-Aware Fashion Recommendation and Design with Generative Image Models AUTHORS: Wang-Cheng Kang, Chen Fang, Zhaowen Wang and Julian McAuley Relevance to SIGIR: 5 (excellent) Originality of Work: 4 (good) Technical Soundness: 4 (good) Quality of Presentation: 3 (fair) Impact of Ideas or Results: 4 (good) Adequacy of Citations: 5 (excellent) Reproducibility of Methods: 3 (fair) Overall Recommendation: 1 (weak accept) ----------- Comments to the Author(s) ----------- While previous works have used pre-trained CNN features, the authors propose to use Generative Adversarial Networks for personalized recommendation and design of fashion items. Although the approach is interesting, the contribution is over-advertised in the abstract / introduction. The qualitative results indicate that the generated fashion items do not seem to diverge from their closest items from the training set, as you would expect by reading the abstract / introduction. My primary concern is that the amount of novelty is limited; essentially the authors propose to do fine-tuning instead of using pre-trained features. However, the performance improvements are significant. There are a few minor concerns too: - Last sentence on page 2: “hence can’t be distinguished from” => replace “can’t” with “cannot”. - In section 3.1: “Unlike VBPR, we do not include ….” => the notation VBPR is not previously defined. - In section 4.1: “Our data set was first introduced…” => this is a bad word choice, change to: “The data set that we use was first introduced …” - In section 4.1: “we always report performance for the model that achieved the best performance on our validation set.” => this phrase can be a bit ambiguous: “we report performance … on the validation set” or “the best performance on our validation set”. To clear things out, I would say “we always report performance on the test set for the model that achieved the best performance on our validation set.” - After Table 2: “We evaluate the AUC” => change to “We use / compute the AUC”. - Why do you samples 12.800 triplets? How did you came up with this number? Why not use a round number of triples, e.g. 10K or 15K? - In Table 3: Is difference from (h) to (g) significant or (h) to what? The headers for the last two columns are wrong, I think the authors meant “h vs e” and “h vs best”, respectively. ----------- Summary ----------- I think the paper can be accepted for publication, but the authors should improve the quality of their presentation and revise their over-statements in the abstract / introduction. ----------------------- REVIEW 2 --------------------- PAPER: 27 TITLE: Visually-Aware Fashion Recommendation and Design with Generative Image Models AUTHORS: Wang-Cheng Kang, Chen Fang, Zhaowen Wang and Julian McAuley Relevance to SIGIR: 3 (fair) Originality of Work: 4 (good) Technical Soundness: 2 (poor) Quality of Presentation: 4 (good) Impact of Ideas or Results: 4 (good) Adequacy of Citations: 4 (good) Reproducibility of Methods: 3 (fair) Overall Recommendation: 0 (borderline) ----------- Comments to the Author(s) ----------- The paper mainly consists of two parts, the basic idea of the first part is to provide image-based recommendation by using an end-to-end CNN network beginning from the original image pixels (instead of the pre-trained vectors) for recommendation; and the basic idea of the second part is to take the GAN network to generate visually plausible images for users or user groups. I believe the research topic of the paper is worth doing, although recently there are a lot of works trying to apply GAN networks due to its recent popularity. However, I have the following concerns about the model and (most importantly) about the experimental evaluation of this paper. The authors only provided very limited quantitative evaluation results, while it should be possible to quantitatively evaluate the performance for what is currently denoted as qualitative evaluation. For example, the authors has been claiming through out the paper that the model can generate plausible and diverse images, but it is never evaluated how they are "plausible" and "diverse". The authors should have been able to evaluate the plausible (e.g. by visual similarity) and diverse (e.g., by averaged mutual distance) to provide quantitative supports for these claims. The "quantitative evaluation" protocol for objective value comparison in Section 4.3 can be very miss-leading to the readers. In this experiment, the authors attempt to use their own objective function as the valuation metric, which looks weird. This means that the results shown in Figure 4 actually does not interpret anything in terms of the performance on generating images similar to the users' preference, because the optimization algorithm will by nature improve the objective value during optimization, and the generated images will of course get higher objective values. A more reasonable evaluation protocol for this task is that, hold out some images for each user (e.g., images of the lasted purchases), and then evaluate weather the generated images are similar to the testing images. Also in section 4.3, the authors did not provide the results for MF-based methods, although the authors said that MF-based approaches were unable to generate satisfactory results, but it should be beneficial to at least provide the numbers in table 3, because these methods are indeed runnable on the experimental dataset (fewer than 5 items is actually a lot), and their AUC scores will not be unreasonably small numbers (>0.5 as in the random baseline). Some other issues include: For computation of AUC, the authors randomly sampled 12800 triplets for evaluation, how and why this number is selected? and how different selections affect the AUC? The authors said that all the baselines are their own implementations but cited the MyMediaLite implementation for the MM-MF method, which is inconsistent. Besides, in Equation (5) it is not clear what the parameters \Theta really are, the authors only claimed that they are all model parameters, it should better be explicitly presented for reproducibility. Minor issues: Section 1 includes a single subsection 1.1 In table 3 the improves should be "h vs. e" and "h vs. best" ----------- Summary ----------- The general ideas of the paper (i.e., end-to-end CNN for image-recommendation and generating images for users) are interesting and worth doing, but the current form of the paper has severe issues in terms of evaluation, in terms of that 1) the quantitative evaluation is very limited and not persuasive (using the objective function as evaluation measure) and 2) the qualitative evaluation is not well supported by statistics (e.g., image similarity or distance) ----------------------- REVIEW 3 --------------------- PAPER: 27 TITLE: Visually-Aware Fashion Recommendation and Design with Generative Image Models AUTHORS: Wang-Cheng Kang, Chen Fang, Zhaowen Wang and Julian McAuley Relevance to SIGIR: 4 (good) Originality of Work: 3 (fair) Technical Soundness: 4 (good) Quality of Presentation: 4 (good) Impact of Ideas or Results: 3 (fair) Adequacy of Citations: 4 (good) Reproducibility of Methods: 4 (good) Overall Recommendation: 0 (borderline) ----------- Comments to the Author(s) ----------- This paper proposes a fashion recommendation and design system by utilizing Siamese CNN and Generative Adversarial Network. In particular, a pairwise training method extended from VBPR [15] is proposed for recommendation, and the basic GAN is further exploited on the fashion dataset for novel fashion image generation. Strength: 1. The problem of personalized fashion image generation is novel and interesting. 2. Well written and mostly clear. Weakness: 1. The novelty of this paper is somewhat limited. For the recommendation part, it seems to be a simple combination of VBPR [15] and Siamese CNN [43] for fashion recommendation. For the fashion design part, it is interesting to utilize GAN for personalized fashion image generation. However, the adopted GAN architecture is similar to DCGAN [36] without any explicit technical improvement. 2. The experiment parts of fashion recommendation can be improved. The authors should include more recent state-of-the-art methods for performance comparison (e.g. [13], [14], and [A]). [A] He, Ruining, et al. "Sherlock: Sparse Hierarchical Embeddings for Visually-Aware One-Class Collaborative Filtering." In IJCAI 2016. 3. Although the authors adopted the objective evaluation metric (objective value) for evaluation of fashion image generation, it is necessary to conduct a human study to convince the readers the quality of generated images. ----------- Summary ----------- The problem of fashion image generation in this paper is interesting. However, the whole system seems to be a combination of some previous works, and the novelty is somewhat limited. Besides, the authors should include more state-of-the-art baselines and human study for evaluation. ----------------------- REVIEW 4 --------------------- PAPER: 27 TITLE: Visually-Aware Fashion Recommendation and Design with Generative Image Models AUTHORS: Wang-Cheng Kang, Chen Fang, Zhaowen Wang and Julian McAuley Relevance to SIGIR: 5 (excellent) Originality of Work: 3 (fair) Technical Soundness: 4 (good) Quality of Presentation: 3 (fair) Impact of Ideas or Results: 3 (fair) Adequacy of Citations: 4 (good) Reproducibility of Methods: 4 (good) Overall Recommendation: -1 (weak reject) ----------- Comments to the Author(s) ----------- This paper contains two related pieces of work about fashion recommendation. The first part is focused on developing a content based recommender system that is tailored to visual features. The idea is to construct a matrix factorization model with visual features to represent items. It is done by using deep neural networks to build a mapping from the original feature space to the latent feature space. The training is conducted using a pairwise method. At the second part of the paper, the work is extended by generating images that would maximize the estimation from a given user model (also given a category). The idea is to generate novel/new images that have not been given in the dataset. The experiments are conducted using amazon fashion datasets including 5 categories of fashion goods and user ratings. The performance gains are observed when comparing with a few baselines. The generated images are evaluated by comparing the average objective value with the ones obtained from the real images. While the idea is timely, the paper does have the following weakness: 1) Comparing with an existing paper: https://cseweb.ucsd.edu/~jmcauley/pdfs/aaai16.pdf, the novelty of the work presented in this paper is questionable. The authors are required to provide a detailed discussion about the difference between them. It occurs to me that the solution is incremental if the novelty is only limited to from pre-trained features to trained features. To add this, using content features and embedding them into a matrix factorization model have been heavily studied in the past. for instance, factorization machine. It is worth discussing the difference. 2) The evaluation of the generative model need a bit more work. How does it compare to a conditional GAN which takes user preference as a given query? The size of the generated images are very small. not sure any practical usage. 3) The way of using GAN looks trivial. It would be more interesting to see how to develop a model that GAN can be properly trained utilizing user preference. A more principled way can be found in conditional GAN or recent work on pixel2pixel gan. 4) The presentation of the paper requires significant improvements. The paper covers two ideas who are in fact relatively independent. Each of them does not contain significant novelty to earn an accept, and writing them as a single paper would also lose the focus. ----------- Summary ----------- it is borderline paper, but I would vote for a reject due the following reasons: 1) The presentation. two relatively independent ideas. neither of them contains significant novelty. 2) The novelty of the proposed recommender model is small compared to [15]. 3) For image generation, some important work on generative models is missing such as conditional GAN. ------------------------- METAREVIEW ------------------------ PAPER: 27 TITLE: Visually-Aware Fashion Recommendation and Design with Generative Image Models This is the meta-review. The paper proposes an CNN-based image recommendation with visual features instead of pre-trained features, and then takes the GAN to fashion image generation. The topic is relevant to SIGIR. All the reviewers agree that the second problem, i.e. fashion image generation, is interesting and novel. And the paper is easy to follow. While there are some major concerns that could not be neglected, which are summarized as follows: 1. The novelty of the paper is not well-presented. Given some highly relevant previous work, such as VBPR [15], DCGAN [36] and conditional GAN for image generation, more significant differences are expected for novelty issue. And correspondingly, it would be better to make comparative study on more state-of-art baseline approaches. 2. Experimental evaluation should be improved. It is not convincing by comparing the performances in terms of the proposed objective function. User preference based evaluation might be a good idea. 3. It is recommended to organize the two parts of the paper and make them organically connected. For more details and other comments on presentation and reproducibility issues, please refer to the individual reviews.