============================================================================ EMNLP-IJCNLP 2019 Reviews for Submission #1674 ============================================================================ Title: Improving Neural Story Generation by Targeted Common Sense Grounding Authors: Huanru Henry Mao, Bodhisattwa Prasad Majumder, Julian McAuley and Garrison Cottrell ============================================================================ META-REVIEW ============================================================================ Comments: All the reviews are positive and agree that the paper presents a good solution to an interesting problem. The evaluation is based on automatic methods only, but still quite detailed for a short paper. Overall, the paper is well written. ============================================================================ REVIEWER #1 ============================================================================ What is this paper about, what contributions does it make, and what are the main strengths and weaknesses? --------------------------------------------------------------------------- This paper proposes a simple multi-task learning of language modeling and perplexity ranking to improve performance of common sense reasoning tasks. The proposed method that fine-tunes GPT2 pre-trained model using several different datasets including SWAG achieved better perplexity than Fan et al.'s Fusion model on Prompt Ranking task. [strengths] - The proposed method also achieved high performance on two evaluations using SWAG and Story Cloze datasets. [weaknesses] - According to Table 3, the performances of SWAG and Story Cloze improved mainly due to using SWAG dataset for fine-tuning. To demonstrate the usefulness of the two-stage pipeline in multi-task learning, the results of "GPT2->SWAG" are needed, but such results were not presented in Table 3. --------------------------------------------------------------------------- Reasons to accept --------------------------------------------------------------------------- The authors showed that the two-stage fine tuning has a positive effect on Prompt Ranking task. --------------------------------------------------------------------------- Reasons to reject --------------------------------------------------------------------------- To my understanding, the main issue of this paper is to investigate the effectiveness of the proposed multi-task learning scheme in CSR tasks. However, Table 3 lacks the results of a baseline method that uses GPT2 but does not employ multi-task learning. Thus, it is difficult to judge whether the proposed multi-task learning is useful in CSR tasks. --------------------------------------------------------------------------- --------------------------------------------------------------------------- Reviewer's Scores --------------------------------------------------------------------------- Overall Recommendation: 3.5 ============================================================================ REVIEWER #2 ============================================================================ What is this paper about, what contributions does it make, and what are the main strengths and weaknesses? --------------------------------------------------------------------------- In an attempt to improve perplexity, this paper grounds neural story generation in common sense. This is done in a multi-task setup. However in learning CSR the paper does not add an extra classifier, instead it uses perplexity ranking. The WritingPrompts dataset is adapted to fine-tune a language model (as part of the multi-task training setup), which can then be used to generate a story given a prompt or generates its own prompt. The alternate dataset for the multi-task is SWAG and a synthetic dataset that favours human written text (the latter two represent the common sense knowledge). Strength: This is a simple way to bring in common sense knowledge into the model. weaknesses: - --------------------------------------------------------------------------- Reasons to accept --------------------------------------------------------------------------- A step forward in including common sense. --------------------------------------------------------------------------- Reasons to reject --------------------------------------------------------------------------- the results are based on perplexity alone. --------------------------------------------------------------------------- --------------------------------------------------------------------------- Reviewer's Scores --------------------------------------------------------------------------- Overall Recommendation: 4 ============================================================================ REVIEWER #3 ============================================================================ What is this paper about, what contributions does it make, and what are the main strengths and weaknesses? --------------------------------------------------------------------------- This paper looks at leveraging common sense information (via SWAG and a webtext/GPT-generated dataset) to improve story generation. The paper is well-written and motivated, and evaluated across several datasets. Results are strong over the baseline provided, and examples are provided in appendix. Though there are some assumptions made/shortcomings of the work, they are clearly identified and addressed in the paper. The paper doesn't mention human evaluations at any point, a little surprising given the domain, but otherwise is thoroughly evaluated. --------------------------------------------------------------------------- Reasons to accept --------------------------------------------------------------------------- This paper addresses a topic of broad interest to the generation community (improving the commonsense aspects of generated text) and combines state-of-the-art architectures, datasets, and techniques in a clever way to achieve impressive results. Results demonstrate there is plenty of room and possibility to improve over GPT2 baselines. --------------------------------------------------------------------------- Reasons to reject --------------------------------------------------------------------------- None. --------------------------------------------------------------------------- --------------------------------------------------------------------------- Reviewer's Scores --------------------------------------------------------------------------- Overall Recommendation: 4.5 Questions for the Author(s) --------------------------------------------------------------------------- -I found the description of the prompt ranking task slightly confusing. When you say you find a "random sample" to be correct, is this referring to the randomly sampled prompt? -While perplexity and commonsense are interesting and compelling evaluations, did you consider running any human evaluations, given the themes of commonsense and stories? Is it possible to meaningfully compare texts this long? It is impressive that the models can generate such long stories, but this also complicates traditional approaches to evaluating generated text (e.g., giving 2 stories to a Turker and asking them to pick their favorite). --------------------------------------------------------------------------- Typos, Grammar, Style, and Presentation Improvements --------------------------------------------------------------------------- -In equation 1, should x_1:t be x_1:t-1 ---------------------------------------------------------------------------