============================================================================ EMNLP 2021 Reviews for Submission #3560 ============================================================================ Title: Weakly Supervised Contrastive Learning for Chest X-ray Report Generation Authors: An Yan, Zexue He, Xing Lu, Jiang Du, Eric Chang, Amilcare Gentili, Julian McAuley and Chun-Nan Hsu ============================================================================ META-REVIEW ============================================================================ Comments: This paper studies report generation for chest x-rays. The key novelty is the addition of a contrastive loss function between the x-rays and the reports. The reviewers note that the application of contrastive image-text matching is novel for the chest x-ray domain, but is otherwise an already well-known technique. The reviewers praise the paper's writing, strong experimental results, and thoroughness of the experiments given the space constraints of a short paper. Overall, the reviewers lean positive but they do criticize the novelty and note that the paper is relatively niche so its audience at EMNLP may be limited. ============================================================================ REVIEWER #1 ============================================================================ What is this paper about, what contributions does it make, and what are the main strengths and weaknesses? --------------------------------------------------------------------------- One of the major problems for generating chest X-ray reports is that cross-entropy loss often leads to the problem that models tend to generate normal findings. This paper proposes a novel contrastive-learning-based method to address this issue. Strengths: + The paper is clearly written. + The idea of using contrastive learning is novel. + The proposed method achieves state-of-the-art performance. Weaknesses: + Some related works are missing, please see the missing references section. --------------------------------------------------------------------------- Reasons to accept --------------------------------------------------------------------------- The idea of using contrastive learning to address the problem that the models tend to generate normal findings is novel. The paper is clearly written. The experimental results show the effectiveness of the model. --------------------------------------------------------------------------- Reasons to reject --------------------------------------------------------------------------- N/A --------------------------------------------------------------------------- Questions for the Author(s) --------------------------------------------------------------------------- N/A --------------------------------------------------------------------------- Missing References --------------------------------------------------------------------------- Jing, Baoyu, Zeya Wang, and Eric Xing. "Show, describe and conclude: On exploiting the structure information of chest x-ray reports." arXiv preprint arXiv:2004.12274 (2020). This paper uses reinforcement learning to encourage the model to generate more diverse and accurate reports. --------------------------------------------------------------------------- Typos, Grammar, Style, and Presentation Improvements --------------------------------------------------------------------------- N/A --------------------------------------------------------------------------- --------------------------------------------------------------------------- Reviewer's Scores --------------------------------------------------------------------------- Reproducibility: 4 Ethical Concerns: No Overall Recommendation - Short Paper: 4 ============================================================================ REVIEWER #2 ============================================================================ What is this paper about, what contributions does it make, and what are the main strengths and weaknesses? --------------------------------------------------------------------------- This paper introduces contrastive learning to Chest x-ray report generation. The core idea is to find the similar x-ray images by clustering and then encourage the model to distinguish the report generation of these similar images. A loss function is proposed to maximize the similarity between the pair of source image and target sequence, while minimizing the similarity between the negative pairs. Experimental results demonstrate the effectiveness of the proposed approach. --------------------------------------------------------------------------- Reasons to accept --------------------------------------------------------------------------- The exploitation of distinguishing similar samples to enhance the x-ray report generation will be interesting to the audiences. The novel weakly supervised contrastive loss is proposed and demonstrated effective. --------------------------------------------------------------------------- Questions for the Author(s) --------------------------------------------------------------------------- It will be good to include some samples of generated reports that provide more details to help understand the effect of contrastive learning. --------------------------------------------------------------------------- --------------------------------------------------------------------------- Reviewer's Scores --------------------------------------------------------------------------- Reproducibility: 4 Ethical Concerns: No Overall Recommendation - Short Paper: 3.5 ============================================================================ REVIEWER #3 ============================================================================ What is this paper about, what contributions does it make, and what are the main strengths and weaknesses? --------------------------------------------------------------------------- Task : The paper presents a model for image chest x-ray to text report generation. Problem : Datasets for chest x-ray report generation task contain mostly normal findings. Previous approaches struggle to generate long, diverse and clinically accurate reports. Contribution : This paper proposes an additional weakly supervised contrastive loss to augment the training previous state-of-art model that utilized memory-augmented transformer architecture. Approach : The weakly supervised contrastive loss gives higher weightage to "hard" negative examples (negative examples that are semantically close to the ground truth). They use a domain-specific transformer to generate embeddings of text reports which are then clustered using K-Means. Negative examples belonging to the same cluster (semantically close) are added as "hard" negatives. Weightage between "hard" (same semantic cluster) and "easy" (different semantic cluster) is controlled by a hyper-parameter. Strengths : - Largely well written paper for a short paper. - New state-of-art results. - Evaluated on both clinical accuracy and quality of generated text reports. Weaknesses : - Ideas presented are quite well known. Contrastive learning (CL) as well as including "hard" negatives in training when using CL framework are very well understood in the CV community and are also not novel in NLP community. - Not a new model. They add an additional loss term for previous SoTA architecture. - Only minor improvements in results when compared to the previous SoTA (29.4 to 30.0 in F1 metric for clinical accuracy ; 23.0 to 24.1 for ROUGE-L metric for NLG). - Supplementary material lacks key details. --------------------------------------------------------------------------- Reasons to accept --------------------------------------------------------------------------- The paper presents state-of-art results over the task it tackles. It presents a neat usage of contrastive learning and achieves consistent albeit minor improvements across the board. Even while being a short paper, it contains ablation studies and investigations into different contrastive loss formulations. --------------------------------------------------------------------------- Reasons to reject --------------------------------------------------------------------------- The problem area is niche and the contributions of the paper, while presenting a unique application to the task at hand, are not unique or unknown in the NLP and CV communities. As such there is not significant novelty involved. This paper might be better served at a directed workshop than at the main conference. --------------------------------------------------------------------------- Questions for the Author(s) --------------------------------------------------------------------------- Section 3 - In training the weakly supervised contrastive loss, the paper mentions alpha (that is used to weight the "hard" negative vs. "easy" negative examples) is "adaptive". Based on table 2, it seems that alpha is just a hyper-parameter and your approach uses a fixed alpha of 2. - How is this alpha "adaptive" ? - Did you utilize any form of curriculum learning (increasing the difficulty of samples incrementally during training) ? - Was there any ablation study done to see the effect when alpha > 2 is used (table 2 presents results of alpha values 0, 1, 2). How many negative examples are used in training ? 10 / 20 or entire training set ? Is there any study on the effect of using fewer vs. more negative examples ? --------------------------------------------------------------------------- Missing References --------------------------------------------------------------------------- Kindly include references regarding curriculum learning i.e. increasing the proportion of easy to hard negative examples when using contrastive learning. Some papers I can point to (not exhaustive) : Curriculum Learning for Natural Language Understanding, Xu et. al. 2020 Visualizing and Understanding Curriculum Learning for Long Short-Term Memory Networks, Cirik et. al. 2016 Cooperative Learning of Audio and Video Models from Self-Supervised Synchronization, Korbar et. al. 2018 --------------------------------------------------------------------------- Typos, Grammar, Style, and Presentation Improvements --------------------------------------------------------------------------- Section 3: Notation unclear at times. For example eq. 1, y introduced without explanation. Section 4: Implementation details of somethings unclear. For example, how was K determined for K-Means clustering. Supplementary Material : The introduction mentions that other approaches lead to high frequency tokens or sentences appearing too often. Kindly include some examples of texts generated by previous approaches as well so the quality differences can be seen. --------------------------------------------------------------------------- --------------------------------------------------------------------------- Reviewer's Scores --------------------------------------------------------------------------- Reproducibility: 5 Ethical Concerns: No Ethics Justification --------------------------------------------------------------------------- None. The paper uses existing datasets. --------------------------------------------------------------------------- --------------------------------------------------------------------------- Reviewer's Scores --------------------------------------------------------------------------- Overall Recommendation - Short Paper: 3.5