============================================================================ EMNLP 2020 Reviews for Submission #1797 ============================================================================ Title: Like hiking? You probably enjoy nature: Persona-grounded Dialog with Commonsense Expansions Authors: Bodhisattwa Prasad Majumder, Harsh Jhamtani, Taylor Berg-Kirkpatrick and Julian McAuley ============================================================================ META-REVIEW ============================================================================ Comments: This paper proposed commonsense extension to persona-grounded dialog modeling. The authors proposed to use COMET to combine cs knowledge into personalized dialog generation, with good results of diversity and quality. Though the technical novelty is limited, it is a nice extension to existing work considering it combines both persona and knowledge. The experiment results can be stronger if it provides error analysis, the number of sentences influencing the generation quality, and evaluating on other datasets (for instance Weibo data, Personalized Dialogue Generation with Diversified Traits). ============================================================================ REVIEWER #1 ============================================================================ What is this paper about, what contributions does it make, and what are the main strengths and weaknesses? --------------------------------------------------------------------------- In order to infer the implications of given persona descriptions, this paper proposed a persona-grounded dialog with commonsense expansions. The main strengths: 1. They use the existing commonsense knowledge bases and paraphrasing resources to imbue dialog models with access to an expanded and richer set of persona descriptions; 2. This work explore two automatic ways to generate the expansions, COMET and paraphrase, rather than manual rewriting; 3. Their model outperforms baselines in terms of dialog quality and diversity. The main weaknesses: 1. The writing can be improved. --------------------------------------------------------------------------- Reasons to accept --------------------------------------------------------------------------- Their model outperforms baselines in terms of dialog quality and diversity. --------------------------------------------------------------------------- Reasons to reject --------------------------------------------------------------------------- No. --------------------------------------------------------------------------- --------------------------------------------------------------------------- Reviewer's Scores --------------------------------------------------------------------------- Reproducibility: 4 Overall Recommendation: 4 Questions for the Author(s) --------------------------------------------------------------------------- No questions --------------------------------------------------------------------------- Typos, Grammar, Style, and Presentation Improvements --------------------------------------------------------------------------- ‘6 Related Works’ -> ‘6 Related Work’ --------------------------------------------------------------------------- ============================================================================ REVIEWER #2 ============================================================================ What is this paper about, what contributions does it make, and what are the main strengths and weaknesses? --------------------------------------------------------------------------- This paper is about personalisation dialogue for open-domain chit-chat bots, giving bots predefined personality (more specific, a set of persona sentences describing some facts). The goal is to have a chatbot that is consistent to itself and at the same time want to maximise the response diversity. The authors propose two ways to help their model reason over persona sentences: 1) common sense knowledge expansion using COMET and 2) sentence paraphrase using back-translation model. The challenge here becomes how to select relevant information from plenty of textual data for response generation. They use discrete latent variable z to select a persona sentence, where they use posterior P(z|x, H, C) to train the prior P(z|H, C) using REINFORCE. Also, the generator is training using supervision from x. From the results shown in the paper, it is quite promising and persuasive, especially the human evaluation. I am not really surprised when I see that some metrics even better than the "gold" responses because I personally know how the persona-chat dataset looks like, some of the original responses are not really engaging and relevant due to the way of data collection. --------------------------------------------------------------------------- Reasons to accept --------------------------------------------------------------------------- * It is a good idea to combine COMET and sentence paraphrasing to improve the diversity of open-domain chit-chat model. I will be more than happy to see any follow-up works towards this direction. * Although some details might be missing, in general, the whole paper is well-written and easy-to-follow. --------------------------------------------------------------------------- Reasons to reject --------------------------------------------------------------------------- * I did not see a big problem of this work. * To be picky, I would expect some error analysis and expect some analysis about how the number of expanded knowledge sentences influence the overall performance. --------------------------------------------------------------------------- --------------------------------------------------------------------------- Reviewer's Scores --------------------------------------------------------------------------- Reproducibility: 4 Overall Recommendation: 4 Questions for the Author(s) --------------------------------------------------------------------------- * I may miss some parts by where do you define your reward function? Is it based on the output perplexity or BLEU? * Instead of using discrete representation, have you tried to use continuous setting? Since you do not need to really input words from those expansion but the encoded representations. * Do you have any post-processing to control the quality of paraphrasing? * I am not quite sure but I am assuming that the discrete variable you are using is like a 1-hot representation in the end because it seems that you are only picking up 1 persona sentence C_k. Why is this? Have you thought about using a threshold-based method to have dynamic numbers of persona sentences as input? --------------------------------------------------------------------------- ============================================================================ REVIEWER #3 ============================================================================ What is this paper about, what contributions does it make, and what are the main strengths and weaknesses? --------------------------------------------------------------------------- *Summary* 1. This work proposes expanding persona sentences using commonsense knowledge bases and paraphrasing techniques. It also proposes to the COMPAC model, comprising of the pretraining GPT2 and RoBERTa embedding, for fine-grained choice of a persona sentence for response generation. The proposed approach is evaluated on the PersonaChat dataset and outperforms three competitive baselines in both automatic and human evaluation. *Positives* 1. It is an interesting and practical idea to expand the persona sentences because they are often too few (3-5 short sentences) per character. To implement this idea, this work uses two existing techniques: commonsense knowledge base COMET and an off-the-self paraphrase network based on back-translation. 2. This work proposes a new model named COMPAC, consisting of a prior network and a generator network. Its main focus is to enable fine-grained choice of a persona sentence out of augmented pool to generate a target utterance. 3. Experimental evaluations are thorough on the PersonaChat dataset. Interestingly, COMPAC-original is better than other SOTA models, indicating that the proposed method may perform better with no help of persona augmentation. Surely, more persona sentences further improve the model performance. Other analyses such as fine-grained persona grounding and controllable generation are also intriguing. *Negatives* 1. The technical contribution is rather incremental. More justification may be required. (1) Although it is an interesting idea to populate persona sentences, this work relies on existing techniques to implement this idea – off-the-self commonsense and paraphrase generation. (2) It is also hard to find significant technical novelty in COMPAC beyond the combination of existing techniques such as GPT2 and RoBERTa embedding. The used Bayesian formulation is also rather standard. 2. There is much room for improvement in experimental evaluations. (1) This work performs only a single dataset PersonaChat. Admittedly, it is the most important one for persona dialogue research, but other datasets may need to be adopted to validate the generality of the proposed method. - Weibo data (Personalized Dialogue Generation with Diversified Traits) could be a good candidate, or other types of conditional dialogue datasets may be used. (2) More SOTA baselines need to be compared for better evaluation, including the following paper or even Blender. - A. See et al. What Makes a Good Conversation? How Controllable Attributes Affect Human Judgments. NAACL-HLT 2019. - S. Roller et al. Recipes for building an open-domain chatbot. arXiv 2020. (3) It seems that LIC+KS could be a better baseline than GPT2 for human evaluation because LIC+KS is the winner in human evaluation of the ConvAI2 competition. However, as shown in Table 4, this work uses GPT2-based variants only for comparison. *Conclusion* My initial rate is borderline in that both positives and negatives are nontrivial. I will make my decision after the discussion on the authors’ rebuttal. *Post-rebuttal review* The authors' response partly clarifies the novelty issue and adds suggested experimental evaluations (e.g. Another SOTA model from (A. See et al. 2019) and human evaluation with LIC+KS). Therefore, I'd like to raise my score from 3.5 to 4. --------------------------------------------------------------------------- Reasons to accept --------------------------------------------------------------------------- Please see positives in item #1. --------------------------------------------------------------------------- Reasons to reject --------------------------------------------------------------------------- Please see negatives in item #1. --------------------------------------------------------------------------- --------------------------------------------------------------------------- Reviewer's Scores --------------------------------------------------------------------------- Reproducibility: 4 Overall Recommendation: 4