============================================================================ EMNLP-IJCNLP 2019 Reviews for Submission #518 ============================================================================ Title: Justifying Recommendations using Distantly-Labeled Reviews and Fined-Grained Aspects Authors: Jianmo Ni, Jiacheng Li and Julian McAuley ============================================================================ META-REVIEW ============================================================================ Comments: The paper proposed a new dataset and task for recommendation justification. The approach is well presented and the authors addressed the reviewers‘ concerns well. ============================================================================ REVIEWER #1 ============================================================================ What is this paper about, what contributions does it make, and what are the main strengths and weaknesses? --------------------------------------------------------------------------- This paper introduces new datasets and methods for the task of recommendation justification. It used an extractive approach to identify justification segments, and then proposed two personalized recommendation justification generation models with aspect-planning and aspect-conditional masked language model. Empirical results show that the proposed model can generate good justifications. Strengths: - It introduces a dataset and task for recommendation justification. - Use extractive approaches to extract justification segments from reviews - It introduces a reference-based seq2seq model with aspect planning, as well as an aspect conditional masked language model Weaknesses: - BERT cannot be directly applied to generation. With the sampling technique of [Wang and Cho 2019], an initial sentence needed to be determined. How sensitive is the generation quality depending on the initial sentence is not studied. The stopping condition of how many iterations is also not studied. - The shape of variables should be carefully examined throughout the paper. For instance, in line 311, the output of e=GRU(E) is a 3d tensor of size (l_s, l_r, n), it seems like l_s is the max length of justifications. Why the projection matrix W shape is (l_r,1)? Is this operation trying to compress embedding output e along the #examples dimension (l_r) instead of the #seq_len dimension (l_s)? - For the attention layer, it is confusing that, given the shape (l_s, l_r, n) a 3d tensor (line 311), Why at line 342, a_t^1 = \sum alpha_{tj}^1 e_j would be a vector of size (n,1) but not a matrix (l_r, n)? - There are extensive work on explanationable recommendations, especially on generating personalized reviews such as TransNet (Rose, and Cohen, 2017; Ni et al., 2017; Wang et al., 2018). The authors failed to include that line of work in the Related work section, and did not compare their settings with those prior models. --------------------------------------------------------------------------- Reasons to accept --------------------------------------------------------------------------- - A new dataset and a new task of recommendation justification - Two personalized generation models for generating recommendation justification. --------------------------------------------------------------------------- Reasons to reject --------------------------------------------------------------------------- - Unclear description in the notations, making it hard to understand the settings - Lack of sensitivity check on the generation quality under different initial sentences, which is an essential step in this generation process. The stopping condition of how many iterations is also not studied. - Fail to compare with similar recommendation with explanation baselines. --------------------------------------------------------------------------- --------------------------------------------------------------------------- Reviewer's Scores --------------------------------------------------------------------------- Overall Recommendation: 3 Questions for the Author(s) --------------------------------------------------------------------------- - It would be great if the authors could provide brief explanations to the measures of “Distinct-1” and “Distinct-2”. What is the annotation agreement among human annotations on the three aspects: relevance, informativeness and diversity? - Equation 8 needs to be better defined. - Table 1 is a bit confusing - both the tip and justification seem to mention aspects that do not exist in the reviews. Does the generated justification need to be covered by the reviews or is it okay to make up some aspects? - What is the definition of fine-grained aspect a? A vector? A matrix? --------------------------------------------------------------------------- Missing References --------------------------------------------------------------------------- - Catherine, Rose, and William Cohen. "Transnets: Learning to transform for recommendation." In Proceedings of the Eleventh ACM Conference on Recommender Systems, pp. 288-296. ACM, 2017. - Ni, Jianmo, Zachary C. Lipton, Sharad Vikram, and Julian McAuley. "Estimating reactions and recommending products with generative models of reviews." In Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 783-791. 2017. - Wang, Xiang, Xiangnan He, Fuli Feng, Liqiang Nie, and Tat-Seng Chua. "Tem: Tree-enhanced embedding model for explainable recommendation." In Proceedings of the 2018 World Wide Web Conference, pp. 1543-1552. International World Wide Web Conferences Steering Committee, 2018. --------------------------------------------------------------------------- ============================================================================ REVIEWER #2 ============================================================================ What is this paper about, what contributions does it make, and what are the main strengths and weaknesses? --------------------------------------------------------------------------- The paper introduces a dataset consisting of justifications from recommendations as well as two generation models: Seq2Seq aspect-planning model and aspect-conditional masked language model. The strength of the paper consists of the dataset for justifications from recommendations. The evaluation is fine, uses two datasets, has a qualitative evaluation. One weakness I found was that the related work seems a bit superficial. Additionally, why are the segments with first-person pronoun discarded? They may contain justifications i.e. "I have found the food to be extremely good". --------------------------------------------------------------------------- Reasons to accept --------------------------------------------------------------------------- A dataset consisting of justifications from recommendations. --------------------------------------------------------------------------- --------------------------------------------------------------------------- Reviewer's Scores --------------------------------------------------------------------------- Overall Recommendation: 4 Questions for the Author(s) --------------------------------------------------------------------------- The examples in Table 1, are the tips and justifications related to the reviews? They contain aspects not mentioned in the review. How does this work? --------------------------------------------------------------------------- Typos, Grammar, Style, and Presentation Improvements --------------------------------------------------------------------------- line 95: only a few e-commerce systems provides -> provide footnote 2: typo singular line 267: an user persona -> a user footnote 3: to the prevent -> to prevent --------------------------------------------------------------------------- ============================================================================ REVIEWER #3 ============================================================================ What is this paper about, what contributions does it make, and what are the main strengths and weaknesses? --------------------------------------------------------------------------- The paper addresses the issue of recommendation justification, which is about providing rationale for why the given product or business is recommended. For this, the authors present a Yelp dataset annotated with whether the recommendation justification is good or bad. They also present a pipeline to identify candidate recommendation justifications from existing reviews, along with 2 models to generate recommendation justifications. It's an interesting work tackling a useful problem. The experiment is conducted on several models both automatically and manually. However, important details about the annotation process is missing. First, it's not clear how many segments were used to compute the inter-annotator agreement. It should be computed on all data annotated after the iterative process is completed. Second, what is meant by "Cohen's kappa [...] after alignment"? What is there to align if annotators are given segmented text to label? Lastly, given that Cohen's kappa is used, I presume there were 2 annotators. Then, how were the final labels determined when a segment is labeled inconsistently? A common practice is to have an independent adjudicator, but there is no mention of it. --------------------------------------------------------------------------- Reasons to accept --------------------------------------------------------------------------- Extracting the rationale for the recommendations itself can be very valuable to members from the industry (and academia). It is a step toward NLP systems providing meaningful explanations for their output. --------------------------------------------------------------------------- Reasons to reject --------------------------------------------------------------------------- The annotation process is not well documented and may be conducted in a technically unsound manner. --------------------------------------------------------------------------- --------------------------------------------------------------------------- Reviewer's Scores --------------------------------------------------------------------------- Overall Recommendation: 3.5 Typos, Grammar, Style, and Presentation Improvements --------------------------------------------------------------------------- "fined-grained" --> "fine-grained" ---------------------------------------------------------------------------