Reviewer #1 Questions 1. {Summary} Please briefly summarize the main claims/contributions of the paper in your own words. (Please do not include your evaluation of the paper here). This paper proposes a framework that is able to remove ambiguous jargon, and misleading information in medical reports and rewrite the ambiguities with lay-person terms based on contrastive pretraining and perturbation-based rewriting. Together with the framework, two datasets are released to test and validate the work. Reports show an overall positive improvement over the best baseline approach. 2. {Strengths and Weaknesses} Please provide a thorough assessment of the strengths and weaknesses of the paper, touching on each of the following dimensions: novelty, quality, clarity, and significance. Strength: 1. The paper is very well written and easy to follow, despite the complexity of the framework proposed. First, the model illustration is presented; next each of the main components and loss functions is described in concept and formalized. 2. The experimental setup is extensive, including the comparison with the state of the art algorithm, and a somewhat ablation study. 3. The codes for the framework is readily available, so the work is reproducible, encouraging future research. Hyperparameters were also discussed in detail (although they are available in the supplementary material). However, 1. Experiments with only one set of train/validation/test format leaves me wonder if the improvement is stat-sig or not. 2. Evaluation Metrics, especially the fidelity, are confusing on how they are calculated. 3. {Questions for the Authors} Please carefully describe questions that you would like the authors to answer during the author feedback period. Think of the things where a response from the author may change your opinion, clarify a confusion or address a limitation. Please number your questions. 1. Can cross-validation be done in this work? 2. Can authors expand the explanation for the evaluation metrics? Perhaps mathematical formulas may help readers understand better. 3. Related Work section, the first sentence in the 'Controlled Text Generation' part has a typo: '... of ccontrolling ...' 4. {Significance of the problem} How significant is the problem? Excellent: The social impact problem considered by this paper is significant and has not been adequately addressed by the AI community. 5. {Engagement with literature} What is the level of engagement with the literature? Good: The paper shows a strong understanding of other literature on the problem, perhaps focusing on various subtopics or on the CS literature. 6. {Novelty of approach} How novel is the approach used in the paper? Good: The paper substantially improves upon an existing model, data gathering technique, algorithm, and/or data analysis technique. 8. {Quality of evaluation} How well is the approach evaluated? Excellent: The evaluation was exemplary: data described the real world and was analyzed thoroughly. 9. {Facilitation of follow-up work} How well does this work facilitate follow-up work? Excellent: open-source code; public datasets; and a very clear description of how to use these elements in practice. 10. {Scope and promise for social impact} What is the paper’s scope and promise for social impact? Good: The likelihood of social impact is high: relatively little effort would be required to put this paper’s ideas into practice, at least for a pilot study. 12. {Reproducibility} Are the results (e.g., theorems, experimental results) in the paper easily reproducible? (It may help to consult the paper’s reproducibility checklist.)checklist.) Excellent: key resources (e.g., proofs, code, data) are available and key details (e.g., proof sketches, experimental setup) are comprehensively described such that competent researchers will be able to easily reproduce the main results. 13. {Ethical considerations} Does the paper adequately address the applicable ethical considerations, e.g., responsible data collection and use (e.g., informed consent, privacy), possible societal harm (e.g., exacerbating injustice or discrimination due to algorithmic bias), etc.? Not Applicable: The paper does not have any ethical considerations to address. 14. (OVERALL EVALUATION) Please provide your overall evaluation of the paper, carefully weighing the reasons to accept and the reasons to reject the paper. Accept: Technically solid paper, with high impact on at least one sub-area of AI or modest-to-high impact on more than one area of AI, with good to excellent quality, reproducibility, and if applicable, resources, and no unaddressed ethical considerations. Top 60% of accepted papers. 15. (CONFIDENCE) How confident are you in your evaluation? Quite confident. I tried to check the important points carefully. It is unlikely, though conceivable, that I missed some aspects that could otherwise have impacted my evaluation. 16. (EXPERTISE) How well does this paper align with your expertise? Mostly Knowledgeable: This paper has little overlap with my current work. My past work was focused on related topics and I am knowledgeable or somewhat knowledgeable about most of the topics covered by the paper. 18. I acknowledge that I have read the author's rebuttal (if applicable) and made changes to my review as needed. Agreement accepted Reviewer #2 Questions 1. {Summary} Please briefly summarize the main claims/contributions of the paper in your own words. (Please do not include your evaluation of the paper here). This paper focus on the ambiguity issue in automatically generating reports. Specially, authors summarize the ambiguities into three groups and propose a novel contrastive learning based rewriting algorithm to ease the above issues. Furthermore, authors also public two medical report domain datasets and the proposed algorithm achieves the competitive performance. 2. {Strengths and Weaknesses} Please provide a thorough assessment of the strengths and weaknesses of the paper, touching on each of the following dimensions: novelty, quality, clarity, and significance. Strengths: 1. The contrastive algorithm is simple yet effective, and easy to follow. 2. The re-annotated datasets make a new way for solving this task. Weakness: 1. Some abbreviations should be defined when they appear at the first time, like VA-, OpenI-. 2.Notions and highlighting shown on the figure need to be explained clearly. 3. Some typos, like constrative ->contrastive , missing "-" in Table 4. 3. {Questions for the Authors} Please carefully describe questions that you would like the authors to answer during the author feedback period. Think of the things where a response from the author may change your opinion, clarify a confusion or address a limitation. Please number your questions. 1. Can you explain more details about human evaluation, especially volunteers? 2. Why KBR (using knowledge base) in OpenI(narrow domain) shows worse results than that in VA(open domain)? 4. {Significance of the problem} How significant is the problem? Good: This paper represents a new take on a significant social impact problem that has been considered in the AI community before. 5. {Engagement with literature} What is the level of engagement with the literature? Good: The paper shows a strong understanding of other literature on the problem, perhaps focusing on various subtopics or on the CS literature. 6. {Novelty of approach} How novel is the approach used in the paper? Fair: The paper makes a moderate improvement to an existing model, data gathering technique, algorithm, and/or data analysis technique. 8. {Quality of evaluation} How well is the approach evaluated? Fair: The evaluation was adequate but had significant flaws: datasets were unrealistic and/or analysis was insufficient. 9. {Facilitation of follow-up work} How well does this work facilitate follow-up work? Good: some elements are shared publicly (data, code, or a running system) and little effort would be required to replicate the results or apply them to a new domain. 10. {Scope and promise for social impact} What is the paper’s scope and promise for social impact? Fair: The likelihood of social impact is moderate: this paper gets us closer to its goal, but considerably more work would be required before the paper’s ideas could be implemented in practice. 12. {Reproducibility} Are the results (e.g., theorems, experimental results) in the paper easily reproducible? (It may help to consult the paper’s reproducibility checklist.)checklist.) Good: key resources (e.g., proofs, code, data) are available and sufficient details (e.g., proofs, experimental setup) are described such that an expert should be able to reproduce the main results. 13. {Ethical considerations} Does the paper adequately address the applicable ethical considerations, e.g., responsible data collection and use (e.g., informed consent, privacy), possible societal harm (e.g., exacerbating injustice or discrimination due to algorithmic bias), etc.? Not Applicable: The paper does not have any ethical considerations to address. 14. (OVERALL EVALUATION) Please provide your overall evaluation of the paper, carefully weighing the reasons to accept and the reasons to reject the paper. Weak Accept: Technically solid, modest-to-high impact paper, with no major concerns with respect to quality, reproducibility, and if applicable, resources, ethical considerations. 15. (CONFIDENCE) How confident are you in your evaluation? Somewhat confident, but there's a chance I missed some aspects. I did not carefully check some of the details, e.g., novelty, proof of a theorem, experimental design, or statistical validity of conclusions. 16. (EXPERTISE) How well does this paper align with your expertise? Mostly Knowledgeable: This paper has little overlap with my current work. My past work was focused on related topics and I am knowledgeable or somewhat knowledgeable about most of the topics covered by the paper. Reviewer #6 Questions 1. {Summary} Please briefly summarize the main claims/contributions of the paper in your own words. (Please do not include your evaluation of the paper here). This paper tackles the problem of simplifying medical reporting language to be accessible to non experts, and reduce perceived ambiguities in what a given medical result description might mean. It presents two new sets of annotations for this task, building on existing document collections. It proposes a contrastive learning framework for the task and presents strong baseline results, including a human evaluation for the dual challenges of improving clarity while maintaining semantic content. 2. {Strengths and Weaknesses} Please provide a thorough assessment of the strengths and weaknesses of the paper, touching on each of the following dimensions: novelty, quality, clarity, and significance. This is an important challenge as patient access to healthcare records is increasing, and as patients are being more actively brought into the decision making process in their care. The paper articulates a fairly clear and novel problem within this space, at a level that is appropriately scoped for focused NLP research. The two goals stated of lay language and low ambiguity are often in conflict. Specialised language exists specifically to reduce ambiguity in a complex domain. I would like to see the paper engage with this tension and why that might mean for the systems proposed. The task name of “medical report disambiguation” is itself ambiguous. What specific portions of the report are being disambiguated? How does this differ from other biomedical word/term sense disambiguation tasks? “Clarification” might be a more accurate description, or aligning with the text simplification literature. There are two important aspects of the human evaluation results in Table 5 that need to be investigated in more detail. First, there is significant variability in these results, both between systems within a dataset, and across datasets within systems. Nor do the systems perform consistent,y with respect to one another. This suggests that the model learning is fragile, and may be latching onto surface phenomena rather than the semantic regularities that are desired. Second, even a fidelity of 92% on OpenI means that if the system were to be scaled to a million patients, eighty thousand of them would see wrong information in the “simplification”. Both of these issues pose real challenges for moving this line of work forward towards something that could actually be used in practice. This paper needs to recognise these challenges and engage with why they matter and what tackling them might look like. The data description needs more specificity. Medical reports may be long and complex documents, or may be a few brief sentences. The task appears to be formulated at the sentence level, but this decomposition of reports into sentences is not explicitly stated. In addition, the paper does not describe how reports are segmented into sentences—a vital preprocessing step that has enormous impact on downstream processing. Table 2 is also not clear that these are sentence counts, not report (ie document) counts. The dataset annotation process is described in reasonable detail, though it could be more explicit regarding the number of rounds required to reach consensus, the trajectory of inter annotator agreement over those rounds, and the final split of individually annotated samples. The composition of the medical team is also important information that is missing. The reference list needs to be updated with publication venues. Nearly half the references listed are given only as arXiv preprints. If these references have been peer reviewed and published, that venue information is important for the reader to understand (1) that they are reliable references and (2) what kind of audience they are speaking to. If this many references have not been peer reviewed, then this paper may be inappropriately situated in the accepted literature. 3. {Questions for the Authors} Please carefully describe questions that you would like the authors to answer during the author feedback period. Think of the things where a response from the author may change your opinion, clarify a confusion or address a limitation. Please number your questions. What kinds of risks does a process like this pose? How can these clarifications mislead patients as well as inform them, and how might those misunderstandings be addressed? 4. {Significance of the problem} How significant is the problem? Good: This paper represents a new take on a significant social impact problem that has been considered in the AI community before. 5. {Engagement with literature} What is the level of engagement with the literature? Good: The paper shows a strong understanding of other literature on the problem, perhaps focusing on various subtopics or on the CS literature. 6. {Novelty of approach} How novel is the approach used in the paper? Good: The paper substantially improves upon an existing model, data gathering technique, algorithm, and/or data analysis technique. 8. {Quality of evaluation} How well is the approach evaluated? Good: The evaluation was convincing: datasets were realistic; analysis was solid. 9. {Facilitation of follow-up work} How well does this work facilitate follow-up work? Fair: moderate effort would be required to replicate the results or apply them to a new domain. 10. {Scope and promise for social impact} What is the paper’s scope and promise for social impact? Fair: The likelihood of social impact is moderate: this paper gets us closer to its goal, but considerably more work would be required before the paper’s ideas could be implemented in practice. 12. {Reproducibility} Are the results (e.g., theorems, experimental results) in the paper easily reproducible? (It may help to consult the paper’s reproducibility checklist.)checklist.) Fair: key resources (e.g., proofs, code, data) are unavailable and/or some key details (e.g., proof sketches, experimental setup) are unavailable which make it difficult to reproduce the main results. 13. {Ethical considerations} Does the paper adequately address the applicable ethical considerations, e.g., responsible data collection and use (e.g., informed consent, privacy), possible societal harm (e.g., exacerbating injustice or discrimination due to algorithmic bias), etc.? Poor: The paper fails to address obvious ethical considerations. 14. (OVERALL EVALUATION) Please provide your overall evaluation of the paper, carefully weighing the reasons to accept and the reasons to reject the paper. Borderline accept: Technically solid paper where reasons to accept, e.g., good novelty, outweigh reasons to reject, e.g., fair quality. Please use sparingly. 15. (CONFIDENCE) How confident are you in your evaluation? Quite confident. I tried to check the important points carefully. It is unlikely, though conceivable, that I missed some aspects that could otherwise have impacted my evaluation. 16. (EXPERTISE) How well does this paper align with your expertise? Expert: This paper is within my current core research focus and I am deeply knowledgeable about all of the topics covered by the paper.