============================================================================ ACL 2018 Reviews for Submission #527 ============================================================================ Title: Personalized Review Generation By Expanding Phrases and Attending on Aspect-Aware Representations Authors: Jianmo Ni and Julian McAuley ============================================================================ REVIEWER #1 ============================================================================ --------------------------------------------------------------------------- Reviewer's Scores --------------------------------------------------------------------------- Appropriateness: Appropriate Adhere to ACL 2018 Guidelines: Yes Adhere to ACL Author Guidelines: Yes Handling of Data / Resources: Yes Handling of Human Participants: N/A Summary and Contributions --------------------------------------------------------------------------- Summary: This paper proposes an architecture able to generate a full review given some seed phrases, attributes about the user and the product he/she is reviewing and some aspects of the product linked to preferences of the user. The paper is well designed and written. It contains a detailed description of the model and experimental evaluation that considers three baselines, an existing system in the literature and several flavours of the proposed model. Contribution 1: A review generating model that considers more input about the writing style of the user and attributes of the product that they are interested in, leading to more personalised reviews. Contribution 2: Experiments showing that the proposed model performs better than relevant baselines and another system of the literature. Contribution 3: --------------------------------------------------------------------------- Strengths --------------------------------------------------------------------------- Strength argument 1: An innovative model for expanding notes to full reviews that is able to input more evidence about the preferences of the user and the attributes of the product. Strength argument 2: The model is accurately defined and discussed in detail. Strength argument 3: The evaluation proves that the model can perform better than the baselines and another system of the literature. Strength argument 4: Strength argument 5: --------------------------------------------------------------------------- Weaknesses --------------------------------------------------------------------------- Weakness argument 1: The model is introduced without enough explanation and reasons why the components that were employed fit the purpose better than other available components. Weakness argument 2: The evaluation considers only one relevant existing system. Weakness argument 3: Weakness argument 4: Weakness argument 5: --------------------------------------------------------------------------- --------------------------------------------------------------------------- Reviewer's Scores --------------------------------------------------------------------------- NLP Tasks / Applications: Moderate contribution Methods / Algorithms: Moderate contribution Theoretical / Algorithmic Results: N/A Empirical Results: Marginal contribution Data / Resources: N/A Software / Systems: N/A Evaluation Methods / Metrics: N/A Other Contributions: N/A Originality (1-5): 2 Soundness/Correctness (1-5): 4 Substance (1-5): 4 Replicability (1-5): 3 Meaningful Comparison (1-5): 3 Readability (1-5): 4 Overall Score (1-6): 4 Additional Comments (Optional) --------------------------------------------------------------------------- - In section 3, L is undefined. --------------------------------------------------------------------------- ============================================================================ REVIEWER #2 ============================================================================ --------------------------------------------------------------------------- Reviewer's Scores --------------------------------------------------------------------------- Appropriateness: Appropriate Adhere to ACL 2018 Guidelines: Yes Adhere to ACL Author Guidelines: Yes Handling of Data / Resources: N/A Handling of Human Participants: N/A Summary and Contributions --------------------------------------------------------------------------- Summary: The paper describes an approach to generating user reviews from review summaries; the model is sensitive to aspects relevant to the item being reviewed, and to user-aspect and item-aspect relationships. An evaluation shows encouraging results on standard metrics, with improvements over baselines and competing models. Contribution 1: A novel model that combines three encoders (for distinct kinds of knowledge) and one decoder with an attention fusion layer. --------------------------------------------------------------------------- Strengths --------------------------------------------------------------------------- Strength argument 1: An evaluation using BLEU and other metrics that shows considerable improvements for the new (ExpansionNet) models over baselines. Strength argument 2: A demonstration that integrating title, attribute and aspect information improves the quality of ExpansionNet's generated reviews, as measured by BLEU scores etc. Strength argument 3: A compact but informative set of evaluations and examples of system output. --------------------------------------------------------------------------- Weaknesses --------------------------------------------------------------------------- Weakness argument 1: The paper criticises Attr2Seq for generating spurious aspects, but the final ExpansionNet+attribute&aspect model also does this in the Figure 1 example: the sentence "i have not tried the tablet app yet but i do n't have any problems with it" seems to conflict with the review summary phrase "nice standard apps". --------------------------------------------------------------------------- Questions to Authors (Optional) --------------------------------------------------------------------------- Question 1: It's not clear how the figures in Table 2 were arrived at. The test set contains 99K reviews so presumably the numbers of aspects generated were not counted manually - so how were they counted? Question 2: Are the figures in Table 2 arithmetic means per review? It's not clear. --------------------------------------------------------------------------- --------------------------------------------------------------------------- Reviewer's Scores --------------------------------------------------------------------------- NLP Tasks / Applications: Marginal contribution Methods / Algorithms: Strong contribution Theoretical / Algorithmic Results: N/A Empirical Results: Moderate contribution Data / Resources: N/A Software / Systems: N/A Evaluation Methods / Metrics: N/A Other Contributions: N/A Originality (1-5): 4 Soundness/Correctness (1-5): 4 Substance (1-5): 4 Replicability (1-5): 4 Meaningful Comparison (1-5): 4 Readability (1-5): 5 Overall Score (1-6): 5 Additional Comments (Optional) --------------------------------------------------------------------------- line 278: NLTK is not from Stanford. Do you mean the tokenizer in Stanford CoreNLP? l.282: how is the sparsity calculated? l.287: "ExpansionNet use" typo. l.323: the yellow highlights disappear completely when the paper is printed in B&W. l.331: "attributes. (Dong et al." typo. l.335: as more input information is added, the mdel obtains better results - except for the ROUGE-L metric. This should be mentioned. --------------------------------------------------------------------------- ============================================================================ REVIEWER #3 ============================================================================ --------------------------------------------------------------------------- Reviewer's Scores --------------------------------------------------------------------------- Appropriateness: Appropriate Adhere to ACL 2018 Guidelines: Yes Adhere to ACL Author Guidelines: Yes Handling of Data / Resources: Yes Handling of Human Participants: N/A Summary and Contributions --------------------------------------------------------------------------- Summary: The paper describes a NN-based model for generation of (product) reviews. The model takes some keywords/phrases, the product name and some aspect terms (in the sense of opinion mining) as input and generates a review drawing upon a given dataset (a collection of reviews with their aspects). Contribution 1: The main contribution of the paper is a three encoder-decoder model that allows the consideration of different types of input information into account. Contribution 2: Contribution 3: --------------------------------------------------------------------------- Strengths --------------------------------------------------------------------------- Strength argument 1: The main strength (and novelty) of the described work is that it allows for the incorporation of external textual information into the generation process. Strength argument 2: The examples of the generated reviews show that the achieved quality is rather high - especially when attributes and aspects are taken into account. Strength argument 3: Strength argument 4: Strength argument 5: --------------------------------------------------------------------------- Weaknesses --------------------------------------------------------------------------- Weakness argument 1: The authors claim that they generate personalized reviews. However, it seems that they have tested their model only with summaries that come with each original review as (user) input phrases. This raises the question about the "personalization" of the generated reviews. Weakness argument 2: Related to the point above: How can BLEU measure the quality of a "personalized" review when it compares the generated review with the original (i.e., user-neutral) review? A different evaluation metric seems necessary. Weakness argument 3: It is unclear from the example in Figure 1 what the user id "A3G8..." is supposed to be good for. Does it point to a user profile which contains useful information? If yes, what kind of information is this? Weakness argument 4: Weakness argument 5: --------------------------------------------------------------------------- --------------------------------------------------------------------------- Reviewer's Scores --------------------------------------------------------------------------- NLP Tasks / Applications: Moderate contribution Methods / Algorithms: Moderate contribution Theoretical / Algorithmic Results: Moderate contribution Empirical Results: N/A Data / Resources: N/A Software / Systems: Moderate contribution Evaluation Methods / Metrics: N/A Other Contributions: N/A Originality (1-5): 3 Soundness/Correctness (1-5): 4 Substance (1-5): 4 Replicability (1-5): 3 Meaningful Comparison (1-5): 4 Readability (1-5): 4 Overall Score (1-6): 4