============================================================================ EMNLP 2020 Reviews for Submission #1640 ============================================================================ Title: Learning Visual-Semantic Embeddings for Reporting Abnormal Findings on Chest X-rays Authors: Jianmo Ni, Chun-Nan Hsu, Amilcare Gentili and Julian McAuley ============================================================================ META-REVIEW ============================================================================ Comments: The authors focus on the topic of report generation from Chest X-rays. They propose to identify the abnormal parts of the Chest X-rays for report generation which is novel and achieve promising results on public datasets. This is an interesting contribution for the biomedical domain. ============================================================================ REVIEWER #1 ============================================================================ What is this paper about, what contributions does it make, and what are the main strengths and weaknesses? --------------------------------------------------------------------------- This paper focuses on automatically generating reports for medical images. Due to the data imbalance issue, previous works tend to generate normal findings as they dominate the dataset. However, people are more concerned about abnormal findings. Thus, this work proposes to focus on abnormal findings. Instead of using an encoder-decoder generation model, this paper adopts a retrieval-based method that tries to measure the similarity between images and abnormal findings. --------------------------------------------------------------------------- Reasons to accept --------------------------------------------------------------------------- This paper points out that data imbalance is one of the problems in medical reports generation. I appreciate the motivation of this paper and this may bring some insights for future research. --------------------------------------------------------------------------- Reasons to reject --------------------------------------------------------------------------- My major concern is that the improvement of the proposed retrieval-based method is very marginal. As shown in Table 1, for example, the improvement in accuracy is only 1%. Although CVSE outperforms baselines in language generation metrics on the abnormal dataset, this paper does not demonstrate that CVSE is also the best on the complete dataset. I may miss something here but I would expect more comparison results of CVSE and other methods on both complete and abnormal datasets or maybe other medical images datasets. Otherwise, it is hard to convince people that the proposed method is very effective. Post rebuttal: The rebuttal has completely addressed my main concern. Thus I will raise my score. --------------------------------------------------------------------------- --------------------------------------------------------------------------- Reviewer's Scores --------------------------------------------------------------------------- Reproducibility: 4 Overall Recommendation: 3.5 ============================================================================ REVIEWER #2 ============================================================================ What is this paper about, what contributions does it make, and what are the main strengths and weaknesses? --------------------------------------------------------------------------- The authors present their approach to generate automatic reports analyzing Chest X-rays. They robust self-supervised approaches and re-formulate the task as a cross-modal retrieval task. --------------------------------------------------------------------------- Reasons to accept --------------------------------------------------------------------------- The results are quite encouraging, and the authors are describing their work in detail to motivate the community to think with them in the field of research, definitely laudable. --------------------------------------------------------------------------- Reasons to reject --------------------------------------------------------------------------- The task is hard, and metrics are - in absolute terms - low. An evaluation on what to expect (e.g. by stating annotator agreement, or by having something like a Likert score) would help us to position the work regarding what to expect from the numbers. --------------------------------------------------------------------------- --------------------------------------------------------------------------- Reviewer's Scores --------------------------------------------------------------------------- Reproducibility: 4 Overall Recommendation: 4 ============================================================================ REVIEWER #3 ============================================================================ What is this paper about, what contributions does it make, and what are the main strengths and weaknesses? --------------------------------------------------------------------------- In this paper, the author discussed how to generate reports from chest X-rays base on abnormal findings, in order to avoid generating repetitive sentences. Main strengths: This paper proposed a novel method to avoid the common issues inherent in test generation, and the method is easy to follow and extend. Main weaknesses: The generated reports consist of abnormal findings in history data, it’s hard to generate appropriate reports for patients with rare diseases. --------------------------------------------------------------------------- Reasons to accept --------------------------------------------------------------------------- This paper proposed a text generation method based on specific sentences in history data, this is a possible way to avoid common issues in text generation models. --------------------------------------------------------------------------- Reasons to reject --------------------------------------------------------------------------- The method is hard to generate appropriate reports for patients with rare diseases, which is very important in practical application. --------------------------------------------------------------------------- --------------------------------------------------------------------------- Reviewer's Scores --------------------------------------------------------------------------- Reproducibility: 2 Overall Recommendation: 3.5 Questions for the Author(s) --------------------------------------------------------------------------- The method in this paper generates reports from abnormal findings in history data. But in real scenario, it’s very important to diagnose patients with rare disease, and it’s hard to find a dataset with enough samples about these patients, can this method generate appropriate reports for these patients? What’s more, one symptom may have several ways to describe, and there are many symptoms in medical field, it’s hard to guarantee the dataset is large enough, how to reduce dependency on dataset? --------------------------------------------------------------------------- Missing References --------------------------------------------------------------------------- n/a --------------------------------------------------------------------------- Typos, Grammar, Style, and Presentation Improvements --------------------------------------------------------------------------- n/a --------------------------------------------------------------------------- ============================================================================ REVIEWER #4 ============================================================================ What is this paper about, what contributions does it make, and what are the main strengths and weaknesses? --------------------------------------------------------------------------- This paper describes an automatic medical image report generation system. The main contribution is the focus on abnormal findings rather than training on raw image-to-text corpora. The results are strong: there are significant gains in accuracy, precision and recall compared to previous work. There are no important weaknesses except, some terms can be more clearly defined which I list below. I make no claims about the originality of the work because I am not intimately familiar with recent related work. --------------------------------------------------------------------------- Reasons to accept --------------------------------------------------------------------------- Typical image-to-text systems do not take into account the importance of concepts because they are trained using per-word loss on a reference corpus. This is an important shortcoming especially for critical applications like medical diagnosis. This paper presents a reasonable solution to this problem which results in clear improvements in extrinsic evaluation. --------------------------------------------------------------------------- Reasons to reject --------------------------------------------------------------------------- There are no risks of having this paper presented. --------------------------------------------------------------------------- --------------------------------------------------------------------------- Reviewer's Scores --------------------------------------------------------------------------- Reproducibility: 4 Overall Recommendation: 4 Questions for the Author(s) --------------------------------------------------------------------------- I think better descriptions of the following when these terms first appear would improve the readability and widen the potential audience of this paper: 1. Introduction - The definition of "abnormal findings" is not clear from the introduction: are they sentences, a subset of sentences, regions of images, classes of pathology? - What is the motivation for "curating templates for abnormal findings"? - How are "report generation" and "cross-modal retrieval" exactly related? How do we use one for the other? - Please define or point to the definition/reference during the first use of: "conditional visual-semantic embeddings", "hinge-based triplet ranking loss". 2. Related work - Is there a reference for MTI? - In a hybrid retrieval-generation model, what are the templates retrieved? Do they need to be filled, paraphrased or completed in any way? - What are "abnormality graphs", "abnormality terms" and how are they related to "abnormal findings"? 3. Approach - When we switch from R=s1,s2... to R=a1,a2... does the variable change have a significance? s_i are sentences, are a_i also sentences? A subset of the sentences? - How are regions defined in an image? Do we carve it up at preprocessing, or take the intermediate output of some DNN? - During training how are matched/unmatched abnormal findings/images generated? - What are the pre-trained models used in sentence embeddings? - Why do we need mutual exclusivity rules, and how are the resulting groups of abnormal findings exactly used? 4. Baselines - Isn't including only instances that have at least one abnormal finding problematic? The resulting model would almost always try to find something wrong with the input, no? - Reference for DenseNet-121 missing. - "We take top 3 retrieval results" what are retrieval results? This goes back to how we use the groups of abnormal findings during training and inference. The relation between these groups, images, and the sentences during training and inference is never clearly explained. ---------------------------------------------------------------------------