============================================================================ NAACL-HLT 2019 Reviews for Submission #829 ============================================================================ Title: Learning to Attend On Essential Terms: An Enhanced Retriever-Reader Model for Open-domain Question Answering Authors: Jianmo Ni, Chenguang Zhu, Weizhu Chen and Julian McAuley ============================================================================ REVIEWER #1 ============================================================================ What is this paper about, and what contributions does it make? --------------------------------------------------------------------------- Problem/Question: This paper investigates question answering on several challenging datasets, including the ARC (Aristo Reasoning Challenge) science exam questions. Contributions (list at least two): The main contributions of the paper are that it: - Proposes a model for extracting "essential terms", sometimes known as "focus words" or just "keywords", from questions in several different datasets. This model is learned, but achieves competitive performance with the current state-of-the-art essential term extractor that is a large set of manually crafted heuristics. - Demonstrates through a strong empirical evaluation that their ET-RR QA model outperforms other QA methods on these benchmark datasets, including ARC. --------------------------------------------------------------------------- What strengths does this paper have? --------------------------------------------------------------------------- Strengths (list at least two): The paper addresses a challenging problem (question answering on the very difficult "challenge" subset of the ARC science dataset)), and demonstrates state-of-the-art performance. The empirical evaluation of this paper is strong -- all the natural questions I had about where the performance came from were answered by many follow-up experiments and ablation studies. --------------------------------------------------------------------------- What weaknesses does this paper have? --------------------------------------------------------------------------- Weaknesses (list at least two): I have included a detailed list of comments and edits under "Additional Comments" below, most minor and easily resolved. Here I'd like to echo one major issue: *Regarding Table 12* (the evaluation of the performance of the QA system using "essential terms" versus tf.idf scores — I think this is one of the most important tables in the paper, given the papers central claim that it’s making is that the proposed “essential terms” extracted from the question improve performance. What it actually shows is that the proposed essential term extractor is making only very small contributions to overall performance compared to just a TF.IDF model, or concatenating all the question and answer text together (~36% vs 35%). Some notes on this: 1) Demonstrating that a given method of keyword/focus word/essential term extraction benefits challenging questions (other than TREC) is notoriously challenging (and, frustrating). Jansen et al. showed I think a ~2% performance benefit over a TF.IDF baseline for their heuristic focus word extractor on a multi-hop inference model for science QA. Khashabi et al. showed I think a ~4% gain on an IR model, but struggled greatly to show a benefit with any other more complicated inference model. The (perhaps currently unwritten) consensus on this is that the keywords that are important depend strongly on the method used for question answering — you can imagine (for example) an associative method doing better with one subset of keywords, and an “inference” method doing better with another. Most of the keyword extraction methods we as a field have tried to create extract terms that we think would be useful for humans to solve the question, using the cognitive inference meth! ods at our disposal. Machines are very far from having anything close to this inference capacity, and so while we might get better at some of the requisite tasks of complex QA (like keyword extraction), until our models become substantially more complex, it empirically appears unlikely that we would see large benefits. I think that’s okay — complex QA is a hard, complex, multi-stage problem, and we have to solve all the parts of the process one step at a time. But I think the paper would be served by having some discussion of (a) past performance benefits (jansen, khashabi) from keywords on QA, (b) noting that this model does about as well as Khashabi’s system for extracting keywords, and (c) noting somewhat more prominently as a major part of the narrative that the essential terms improve performance on the proposed ET-RR model only very modestly (and this improvement may not meet the threshold for statistical significance). 2) There absolutely must be statistics showing that the numbers in the columns on Table 12 are significantly different, or this paper hasn’t demonstrated utility for the essential term extractor at the threshold required for publication. 3) Even if the essential terms don't (significantly) improve model performance, I don’t think it’s a showstopper (see point 1, above). I think to move forward, the literature would be well served by an honest characterization of the often negative result that is keyword extraction on complex QA, rather than having this talked about in back alleys at conferences. 4) Regardless of whether the 1% gain is significant or not, the effect size is quite small, and overall claims throughout the paper should be reduced somewhat — one would imagine achieving an F1 of 0.80 on keyword extraction would lead to transformative gains on the downstream QA task, but here they’re only ~+1%. It seems most of the performance benefit is from the QA model, and not the essential term extractor. 5) Regardless of the significance of the gain on the essential term extractor, the QA model still makes a very strong and state-of-the-art contribution, and I think should be accepted pending reducing the claims about essential term extraction throughout the paper (and potentially spinning the story of “we thought keywords would help a lot, but it turns out doing very well on the keyword extraction task helps very little on the overall QA task” story from (1) above. --------------------------------------------------------------------------- --------------------------------------------------------------------------- Reviewer's Scores --------------------------------------------------------------------------- Overall Score (1-6): 5 Readability (1-5): 4 Additional Comments (Optional) --------------------------------------------------------------------------- Notes: - Introduction: The paper is about using filtered queries (“essential terms”) and selectively attending to those terms, but the first page and a half (up to “To address these difficulties…”) feels somewhat disorganized, as it begins to talk about different datasets and issues with commonsense reasoning, but doesn’t (in a clear, concise way) say: (a) There are many datasets for QA, (b) either, [1] as shown by X, they suffer from problem Y, or [2] we hypothesize based on ABC that they suffer from problem Y, and (c) we address problem Y through method Z (essential terms/selective attention/reading strategies). The overview of the specific datasets (and Table 1) can happen in the next section for clarity — in the introduction you can keep the reader on track by saying “To be thorough we evaluate our proposed method on X benchmark datasets” (a very strong empirical evaluation, and great aspect of this paper), rather than periodically touching on somewhat hard-t! o-relate issues each has without bringing it all together in a clear, concise way for the reader. - Figure 1: This figure is trying to convey a lot of information: (A) the idea of an unfiltered query (Q+CA_n) vs a filtered query (Essential terms), (B) the idea of passages having overlap with the question and answer terms. I think it has issues with clarity that can be improved, including: (1) While some words are highlighted, (all of) the essential terms aren’t highlighted. in the Q/CA, (2) some words are colored to help show connection — this is good, but that the essential terms aren’t also highlighted makes it challenging for the reader to easily distill the difference between the two methods, and (3) the actual justification/alignment of the various elements of the text combine mixtures of center/left/right justification that make it hard to visually parse. I would make the figure full-page-width, split down the middle, copies of the question at the top of both sides, labels at the top of each side to show what’s going on “unfiltered query”, “essentia! l-term filtered query”, and then go from there down with the appropriate highlighting. - Introduction: “ARC dataset, 192 where our model achieves an accuracy of 36.61%” should specify this is specifically for ARC-Challenge. The easy set has seen models with much higher performance. Related Work: - Related Work: “AI2 Reasoning Challenge (ARC) scientific QA 239 dataset (Clark et al., 2018). This dataset contains 240 elementary-level multiple-choice scientific ques- 241 tions from standardized tests” -> ARC has questions from 3rd to 9th grade (elementary through middle school). - “The challenge set consists of 245 questions that cannot be answered correctly by any 246 of the solvers based on Pointwise Mutual Infor- 247 mation (PMI) or Information Retrieval (IR). “ If I remember correctly, the criterion is that they couldn’t be answered by either IR or co-occurrence (e.g. embedding) methods. - Missing reference/comparison: Before Khashabi (2017), Jansen et al. (Computational Linguistics 2017) “Framing Question Answering as Building and Ranking Answer Justifications” also build an essential term extractor (“focus word extractor”) and demonstrate it’s benefit for science question answering on a subset of the ARC set used in this paper. This appears to be *highly* related work that was missed. - Boratko et al. (2018): I would strongly caution against drawing evidence or conclusions from this paper. Boratko et al.’s two sets of analyses have inter-annotator agreements of k=0.34 for analysis 1 (which they erroneously report as “good agreement”, it’s “minimal”) and k=-0.68 (!), or chance agreement for analysis 2. McHugh (2012) would estimate only 4-15% of the data in analysis 1 reliable, and 0% in analysis 2 reliable. How this paper made it past reviewers is beyond me. Approach: 3.1.5 Choice interaction — this is really interesting. I’d be curious to see more description of this, at a conceptual level, and the hypothesized benefit. 3.2 Essential Term Selector — this is also interesting. Very good to see the essential term extractor itself separately evaluated in 4.1 *Regarding Table 12* (the evaluation of the essential terms versus tf.idf scores — I think this is one of the most important tables in the paper, given the papers central claim that it’s making use of “essential terms” extracted from the question to improve performance. What it actually shows is that the proposed essential term extractor is making only very small contributions to overall performance compared to just a TF.IDF model, or concatenating all the question and answer text together (~36% vs 35%). Some notes on this: 1) Demonstrating that a given method of keyword/focus word/essential term extraction benefits challenging questions (other than TREC) is notoriously challenging. Jansen et al. showed I think a ~2% performance benefit over a TF.IDF baseline for their focus word extractor on science QA. Khashabi et al. showed I think a ~4% gain on an IR model, but struggled greatly to show a benefit with any other more complicated inference model. The (perhaps currently unwritten) consensus on this is that the keywords that are important depend strongly on the method used for question answering — you can imagine (for example) an associative method doing better with one subset of keywords, and an “inference” method doing better with another. Most of the keyword extraction methods we as a field have tried to create extract terms that we think would be useful for humans to solve the question, using the cognitive inference methods at our disposal. Machines are very far from having anyth! ing close to this inference capacity, and so while we might get better at some of the requisite tasks of complex QA (like focus word extraction), until our models become substantially more complex, it empirically appears unlikely that we would see large benefits. I think that’s okay — it’s a complex, multi-stage problem, and we have to solve all the parts one step at a time. But I think the paper would be served by having some discussion of (a) past performance benefits (jansen, khashabi) from keywords on QA, (b) and noting that this model does about as well as Khashabi’s system for extracting keywords. 2) There absolutely must be statistics showing that the numbers in the columns on Table 12 are significantly different, or this paper hasn’t demonstrated utility for the essential term extractor at the threshold required for publication. 3) Even if it doesn’t, I don’t think it’s a showstopper (see point 1, above). I think the literature would be well served by an honest characterization of the negative result that is keyword extraction on complex QA. 4) Regardless of whether the 1% gain is significant or not, the overall claims throughout the paper should perhaps be reduced somewhat — one would imagine achieving an F1 of 0.80 on keyword extraction would lead to transformative gains on the downstream QA task, but here they’re only ~+1%. It seems most of the performance benefit is from the QA model, and not the essential term extractor. 5) Regardless of the significance of the gain on the essential term extractor, the QA model still makes a very strong and state-of-the-art contribution, and I think should be accepted pending reducing the claims about essential term extraction throughout the paper (and potentially spinning the story of “we thought keywords would help a lot, but it turns out doing very well on the keyword extraction task helps very little on the overall QA task” story from (1) above. Experiments: - Strong performance across multiple datasets, and strong meaningful comparison against contemporary models. - Good ablation study and study of query formulation methods to see if this attention method is doing better than natural baselines (like tf.idf). - Table 11: it seems like the inter-choice mechanism from section 3.1.5 isn’t really contributing to performance. I think this is an interesting negative result, and should be kept in the paper — but the description of this in text should be shored up — “It is worth not- 762 ing that the choice interaction layer gives a further 763 0.24% boost on test accuracy.” — it’s strongly unlikely that this is a statistically significant difference. I think it would be good to see a discussion of why this method of comparing the multiple choice answer candidates *isn’t* helping performance, as one might expect it would — perhaps there’s less extra information in force-choice tasks than one might expect, or it’s harder to extract than one might at first think. - General experiment comment: There don’t appear to be any inferential statistical comparisons used on these experiments. I don’t doubt that most of these differences are significant (particularly the main effects), but this really has to be shown. Inferential statistics are a key part of doing science, and NLP is the only field I’m aware of that lets people get away with poor or zero statistics, which only hurts us all. Extra page: - It would be helpful to have an in-depth error analysis, both on the essential term extractor and overall QA model, to understand better their performance and limitations. Open Source / Data: - Reproducibility is critical, and there isn’t any mention of releasing the source code or annotation (e.g. essential terms) generated here. It’s critical that these products be released, and a URL for these to be downloaded should be included with the final submission. References: - The reference list is not in the standard format. e.g. the first reference, “In QA@ACL” should list “In the 1st annual workshop on question answering…” etc. --------------------------------------------------------------------------- ============================================================================ REVIEWER #2 ============================================================================ What is this paper about, and what contributions does it make? --------------------------------------------------------------------------- Problem/Question: This paper looks at answering open-domain multiple choice questions with a standard retrieve + read pipeline. For retrieval, essential terms from the question are selected and then concatenated with each answer choice to collect the relevant passages. Then a reader module assigns a score to each answer choice based on the retrieved evidence. The main hypothesis is that essential term selection can help retrieve better evidences. Contributions (list at least two): 1. The notion of using essential terms to reformulate the query before retrieval is a novel one, and shown to be quite effective in the current paper. 2. A new neural model for essential term selection is presented which is comparable to the previous best from Khashabi et al (CoNLL, 2017). The reader model presented is also quite effective, leading to strong performance on the RACE dataset. 3. Combined, the various components lead to a new state-of-the-art on ARC dataset, and strong performance on 3 other datasets modified for this papers setup. --------------------------------------------------------------------------- What strengths does this paper have? --------------------------------------------------------------------------- Strengths (list at least two): 1. The main strength of this paper is its empirical results. The essential term selector is clearly useful based on its superior performance to the ET-RR (concat) baseline across all datasets. Table 12 also shows that it is more effective than a TF-IDF based approach. 2. The paper is clearly written and easy to follow. The introduction does a good job of motivating the problem and proposed solution and the models are presented clearly. --------------------------------------------------------------------------- What weaknesses does this paper have? --------------------------------------------------------------------------- Weaknesses (list at least two): 1. RACE and MCScript are strange choices to test the proposed pipeline. My understanding is that many questions in these datasets only make sense within the context of the given passage (e.g. "what was used to dig the hole?"). Why not test on real open-domain QA datasets such as TriviaQA or Quasar-T? That would also allow comparing to stronger baselines developed for those works. 2. While the paper is clearly written, many experimental details are missing. There is no description of many baselines in Tables 3, 7 and 9. For Amazon-QA, are only reviews for a particular product concatenated to form the corpus, or all reviews across the dataset? Does the Moqa baseline also select among only two candidate answers? It would also be nice to see some qualitative examples for which questions the model gets right / wrong. 3. It is not clear whether the various model components are all needed for the high performance. E.g. the POS and NE embeddings, and ConceptNet relations. --------------------------------------------------------------------------- --------------------------------------------------------------------------- Reviewer's Scores --------------------------------------------------------------------------- Overall Score (1-6): 4 Readability (1-5): 5 Additional Comments (Optional) --------------------------------------------------------------------------- 1. What is meant by "indirectly related evidence" in the abstract? How does extracting essential terms help retrieve such evidence? 2. Line 48: it is strange to say multiple-choice QA is harder than span-based QA, since the output space is typically much larger for the latter. 3. Line 292: should be {w_t^P}_{t=1}^K. 4. Line 389: should be c \times d_w. 5. Line 414 says "bilinear sequence matching" but Eq. 4 only has a linear operation. 6. Table 8: why not compare to ET-RR pre-trained on RACE as well? --------------------------------------------------------------------------- ============================================================================ REVIEWER #3 ============================================================================ What is this paper about, and what contributions does it make? --------------------------------------------------------------------------- Problem/Question: The paper addresses the problem of finding relevant evidence (paragraphs) in the setting of open domain question answering where the question has several answer choices (multiple choice questions, MCQ). They propose to reformulate the query by first retrieving essential terms from the query and then concatenating with the answer choices. Empirically they show that this leads to better performance overall on the question answering task on three MCQ datasets. They propose a separate retriever and reader model. The retriever model (for identifying essential terms) is separately trained on a different dataset (which has annotations for which question terms are important). Once trained, the retriever model is used in in training the reader model. Contributions (list at least two): * The paper proposes a retriever reader architecture for MCQ open domain datasets * They demonstrate that retrieving is an important process in the open domain setting by showing informed query reformulation leads to increase in performance --------------------------------------------------------------------------- What strengths does this paper have? --------------------------------------------------------------------------- Strengths (list at least two): * The paper was clearly written for most part and was easy to follow. * Query reformulation is an interesting idea and was nice to see it working * The fact that the essential term retriever model works even though it is trained on a different dataset is interesting. * The paper gets sufficient gains on 3 datasets. --------------------------------------------------------------------------- What weaknesses does this paper have? --------------------------------------------------------------------------- Weaknesses (list at least two): * The query reformulation step of the paper heavily depends on the presence of answer choices which makes it dependent of answering only MCQ type questions. Even though, I don’t disagree that solving MCQ questions are not important, but the proposed approach will only work for MCQ type questions. I think the authors should make that clear in the beginning of the paper. * The essential-term retriever model is dependent on another annotated dataset. I think an end to end model (which marginalizes through this objective) will be a more generic way to solve this problem * I don’t understand what is the main contribution of the reader model. Will this query reformation approach not work with any reader model? If that is the case, I would recommend cutting short the reader section (it now takes 2 pages). For the same reason, I would really like to see an experiment where the query reformulation approach is combined with another baseline reader such as BiDAF. This will clearly show that query reformulation is useful. * User queries are usually short. So I have a feeling 10 terms for each question seem high. It would be helpful to see what proportion of the original length of the query is this. * Missing citations: The related work section needs work. Query reformulation although interesting is not a new idea. You definitely need to cite some very relevant IR papers. For example Xu and Croft '03 for query expansion (SIGIR, test of time). You have also missed citing some related work. For example Das et al ICLR 2019 (https://openreview.net/forum?id=HkfPSh05K7) proposes a query reformulation method in vector space which does not depend on answer candidates. I am not penalizing for the latter work since it is concurrent work. But a discussion and comparison is required. --------------------------------------------------------------------------- --------------------------------------------------------------------------- Reviewer's Scores --------------------------------------------------------------------------- Overall Score (1-6): 4 Readability (1-5): 3 =========================================================================== ICLR 2019 (withdrawn during review period) =========================================================================== https://openreview.net/forum?id=BJlif3C5FQ