============================================================================ ACL 2019 Reviews for Submission #911 ============================================================================ Title: Fine-Grained Spoiler Detection from Large-Scale Review Corpora Authors: Mengting Wan, Rishabh Misra, Ndapa Nakashole and Julian McAuley ============================================================================ META-REVIEW ============================================================================ Comments: This paper proposed a large dataset to detect spoiler sentences for book reviews and experimented it using a deep neural network. The proposed method is well motivated and is compared with many baseline approaches. Although the network itself is not a major contribution, the proposed dataset on the other hand exposed the community to an important NLP research question that is useful in many settings. ============================================================================ REVIEWER #1 ============================================================================ What is this paper about, what contributions does it make, what are the main strengths and weaknesses? --------------------------------------------------------------------------- This paper collects a dataset of book reviews from goodreads.com annotated with sentence-level spoiler tags by the review writers, proposing a neural architecture to classify sentences as to whether they are likely to contain spoilers or not. This is generally a well-written and interesting paper that makes good contributions, but it has several substantial I hope the authors will correct before the conference if it is accepted. I think the dataset is the strongest contribution of this paper: it is an order of magnitude larger than previous ones that have been released and will certainly help drive work in this area, and is useful even independent of the spoiler tags. The empirical findings with regards to the distribution of the data (e.g., Figure 2) are also just inherently interesting. The neural model the authors propose is well-motivated and clean and achieves good results. Will you release the code for your model? So to be direct: I generally liked this paper a lot! But I'd like to address what I see as a few substantial areas for improvement that I hope the authors will address in the final version. Firstly, I'm confused as to why the authors did not do a bigram or trigram SVM rather than unigram as a baseline; this is simple to implement and provides a more reasonable comparison with methods like CNNs and biLSTMs. A substantial portion of the paper is dedicated to the model, which is essentially a HAN with a few bells and whistles which intuitively should help for this task. This is fine, but I think several portions are under-explained. It's not immediately clear how the learnable item- and user-biases are implemented, the one sentence explaining it reads: "Here we use learnable scalars b_i , b_u to model the item and user biases which can not be explained by the language model." ... and I'm just not sure exactly what was done. Similarly, for SVM-BOW does "averaged based on" mean a weighted average where the weights are tf-idf scores? If so just say that. Another issue is the fact that the authors (as far as I can tell) only test the model in a situation in which some items occur in both the training and test sets. How well can this model work if we do a stratified split in which items exclusively appear in either training or test? Was any statistical significance testing done to confirm that the model actually performs better? Combined with the above points I am not willing to accept as it stands that this model is actually stronger than e.g. a trigram SVM. Furthermore, the paper has basically no error analysis, and I think it really needs it. What does your proposal as to the model architecture get us that a trigram SVM does not? What is the model actually learning linguistically? The issue of item specificity is interesting, but what is actually being picked up on by the model? Error analysis would help a reader understand what we really learned here. (Edit: after writing the above I looked at the supplementary material and realized it includes some error analysis. This could be made more systematic, more substantial w.r.t. comparing to baselines, and should definitively be in the body of the paper. My perception above is very likely the perception a general reader would have without realizing about the supplementary.) The task could be better-motivated, and indeed potentially pitched as a very interesting AI/linguistic task. Presumably to do this perfectly would require a high degree of common sense reasoning and linguistic understanding. I assume that the model as it stands is not really doing this and instead is picking up on patterns in the easier examples, but again more error analysis would help a lot here. Good work, thanks for the paper! --------------------------------------------------------------------------- Reasons to accept --------------------------------------------------------------------------- - Release of a useful dataset - Proposal of well-motivated bells and whistles on a HAN --------------------------------------------------------------------------- Reasons to reject --------------------------------------------------------------------------- - Insufficient evidence that the model is novel / performing well - Lack of error analysis so it's hard to know what we've learned from this paper --------------------------------------------------------------------------- --------------------------------------------------------------------------- Reviewer's Scores --------------------------------------------------------------------------- Overall Recommendation: 4 Questions and Suggestions for the Author(s) --------------------------------------------------------------------------- Beyond the substantive issues stated as weaknesses above, there are a few clarity issues that could use some attention. Please make the definition of "item" more prominent. Perhaps restate it in the "Item-specificity" section - the definition for "item" appears in the first paragraph of the related work section but to my mind it is overloaded with "document," when I was first reading I had to stop and carefully re-read to realize that "item" is the work being reviewed and "document" is the review. The caption for Figure 2 is insufficiently clear as well, for instance I just simply don't know what's going on in Figure 2e and the caption does not help me understand. --------------------------------------------------------------------------- ============================================================================ REVIEWER #2 ============================================================================ What is this paper about, what contributions does it make, what are the main strengths and weaknesses? --------------------------------------------------------------------------- The paper presents an algorithm and a large dataset for spoiler sentence detection within books reviews. The authors collected a set of 1.4M reviews from Goodreads consisting of 17.7M sentences, where 3.2% of those are tagged by crowd as "spoiler". The algorithm they presented is an improvement of the HAN model by including additional features about the sentence such as item-specific features and user bias. These set of features led to improve the results over the HAN baseline (and other weaker baselines) when tested on the Good reads dataset and an OLD dataset from TV tropes. AUC was used as the evaluation measure. Supplementary material include some useful additional information on the dataset and method of annotation. --------------------------------------------------------------------------- Reasons to accept --------------------------------------------------------------------------- - Large realistic dataset is collected and offered to public, which will create an excellent benchmark dataset for future work in this direction - The new proposed features are engineered based on some analysis to the data and were shown to be useful --------------------------------------------------------------------------- Reasons to reject --------------------------------------------------------------------------- - The contribution itself in the methodological part is not huge. Just few simple features were shown while using know architecture. - The improvement of the model over HAN baseline is marginal - AUC was only used, while F-measure would be very useful to include as well. --------------------------------------------------------------------------- --------------------------------------------------------------------------- Reviewer's Scores --------------------------------------------------------------------------- Overall Recommendation: 3.5 Questions and Suggestions for the Author(s) --------------------------------------------------------------------------- The work is interesting. The largest contribution for this work is in the dataset by suggesting this method for collecting the data from goodreads, which can be replicated to other languages and on large scale. Having the dataset released would make it a nice benchmark. However, the methodological contribution itself is not significant (which might be OK for a short paper). What could be done to improve is exploring more algorithmic solutions that can further improve the model, and please report F--measure for evaluation. --------------------------------------------------------------------------- Missing References --------------------------------------------------------------------------- NA --------------------------------------------------------------------------- Typos, Grammar, and Style --------------------------------------------------------------------------- NA --------------------------------------------------------------------------- ============================================================================ REVIEWER #3 ============================================================================ What is this paper about, what contributions does it make, what are the main strengths and weaknesses? --------------------------------------------------------------------------- This paper focuses on the problem of spoiler detection. The authors present (and plan to release) a large, novel corpus annotated for spoiler information at the sentence level. The authors also introduce a new hierarchical attention network incorporates bias terms for users and items (i.e. review topics) as well as a set of specialized word-level item-specific features. To evaluate their model, the authors compare its test-set performance to the performance of a variety of baseline models, including SVMs, a CNN, and a standard HAN. Overall, I thought this paper was well written and, importantly, the model the authors introduce was clearly defined. I also very much appreciated the comparisons against a wide range of baselines. And, of course, the contribution of this data set is notable, too. My only minor complaint is that the current state-of-the-art (apparently, I am not familiar with this particular domain) model for the TV Tropes data relies on a genre encoder, it would have been nice if the authors incorporated this into their model. --------------------------------------------------------------------------- Reasons to accept --------------------------------------------------------------------------- Overall, I thought this paper was well written and, importantly, the model the authors introduce was clearly defined and interesting. I also very much appreciated the comparisons against a wide range of baselines. And, of course, the contribution of this data set is notable, too. --------------------------------------------------------------------------- --------------------------------------------------------------------------- Reviewer's Scores --------------------------------------------------------------------------- Overall Recommendation: 4