============================================================================ ACL 2023 Reviews for Submission #4124 ============================================================================ Title: Synthetic Pre-Training Tasks for Neural Machine Translation Authors: Zexue He, Graeme Blackwood, Rameswar Panda, Julian McAuley and Rogerio Feris ============================================================================ META-REVIEW ============================================================================ Comments: The paper proposes to pre-train neural machine translation models with synthetic data. The idea is interesting and the results are positive. There are writing issues but is likely to be fixed in the next version. ============================================================================ REVIEWER #1 ============================================================================ What is this paper about and what contributions does it make? --------------------------------------------------------------------------- The authors propose a new approach to pre-training neural models for machine translation using synthetic data, which they then fine-tune with real data. They investigate three methods for generating synthetic data, and evaluate their proposals using a variety of translation tasks. The authors explore three distinct techniques for generating synthetic data: (1) obfuscating parallel data, (2) concatenating phrases generated by a statistical machine translation system, and (3) generating synthetic data without using real data by using three different criteria. While the idea of pre-training neural models from synthetic data is interesting, there are several limitations to their work. On the one hand, the authors' presentation of their proposals is somewhat confusing. On the other hand, the results are not entirely convincing. --------------------------------------------------------------------------- Reasons to accept --------------------------------------------------------------------------- 1. The idea of pre-training neural models from synthetic data is interesting. --------------------------------------------------------------------------- Reasons to reject --------------------------------------------------------------------------- 1. The presentation of the results is somewhat confusing. 2. The results are not entirely convincing. --------------------------------------------------------------------------- Questions for the Author(s) --------------------------------------------------------------------------- - Figure 1: Are the description of the axes in the figure correct? - Section 3.1: If I have understood well the real corpus should be word-aligned. - Section 3.1: the selection of words to obfuscate in the first approach is done randomly, rather than focusing on more important words that contain relevant information for translation. - Section 4: A summary of the data used in the experiments should be included in the main text rather than in an appendix. - Section 4.2: I did not find information about the real data used to build the phrases - Figure 3: Some of the translation results seem strange and require further explanation. - Section 5.3: Has it been verified that the words corresponding to toxic words do not appear in the source? - Figure 3 and Table 1: The results are not compared between the different approaches, which makes it difficult to assess the relative effectiveness of each approach. --------------------------------------------------------------------------- Typos, Grammar, Style, and Presentation Improvements --------------------------------------------------------------------------- - Section 4.1, second paragraph: Figure 4? Maybe it is Figure 3. - Table 3: A vertical bar in the last row is missing. --------------------------------------------------------------------------- --------------------------------------------------------------------------- Reviewer's Scores --------------------------------------------------------------------------- Soundness: 3 Excitement (Long paper): 3 Reviewer Confidence: 3 Recommendation for Best Paper Award: No Reproducibility: 3 Ethical Concerns: No ============================================================================ REVIEWER #2 ============================================================================ What is this paper about and what contributions does it make? --------------------------------------------------------------------------- The paper explores three synthetic pretraining tasks for neural machine translation. Obfuscation replaces real words in parallel training data by syntetically made-up nonsense strings with specifiable levels of probability (.25, .50, .75, 1.0), independently (i.e. not necessarily aligned) in source and target. Concatenated phrases computes phrases using Moses-style parallel (i.e. aligned) phrase extraction and randomly recombines them generating new parallel strings, where as far as I can see, the linear order of source phrases is the same as their corresponding target phrases. Obfuscation and concatenated phrases rely on real parallel data as starting points. By contrast the third approach is completely synthetic and comes in three variants. This first, identity, is identical sequences of synthetic lower case 3 character words in source and target; the second, building on the first, upper cases the target isde of the pairs, and independently deletes source and target words ! with probability 0.15. The third, based on the second, but as far as I can tell without deletion, arbitrarily brackets the source into a binary tree, and randomly switches corresponding subtrees with a certain probability. Synthetic pretraining is shown to mitigate toxicity. --------------------------------------------------------------------------- Reasons to accept --------------------------------------------------------------------------- Probably the first paper that explores this type of synthetic pretraining for NMT. Paper is generally well written and clear. --------------------------------------------------------------------------- Reasons to reject --------------------------------------------------------------------------- In section 4 Experimental Framework different size data sets are used in sections 4.1 (3.5M), 4.2 (2M) and 4.3 (2M). This makes it very difficult to compare approaches. I may have missed this, but what is the motivation for this? To be able to better asses the approaches, it would be important to show effectiveness on different data sizes (like learning curves) and to compare with familiar back-translation based approaches to synthetic data generation. --------------------------------------------------------------------------- Questions for the Author(s) --------------------------------------------------------------------------- Question A: In section 4 Experimental Framework different size data sets are used in sections 4.1 (3.5M), 4.2 (2M) and 4.3 (2M). This makes it very difficult to compare approaches. I may have missed this, but what is the motivation for this? Question B: Question C: --------------------------------------------------------------------------- --------------------------------------------------------------------------- Reviewer's Scores --------------------------------------------------------------------------- Soundness: 3 Excitement (Long paper): 3.5 Reviewer Confidence: 4 Recommendation for Best Paper Award: No Reproducibility: 4 Ethical Concerns: No ============================================================================ REVIEWER #3 ============================================================================ What is this paper about and what contributions does it make? --------------------------------------------------------------------------- The paper describes three approaches to produce synthetic data: (i) obfuscated parallel data (ii) synthetic data created using aligned phrases and (iii) completely synthetic data using synthetic tokens (identity operation, case mapping, permuted trees), and how the synthetic data is used for pre-training. --------------------------------------------------------------------------- Reasons to accept --------------------------------------------------------------------------- Models pre-trained on synthetic data obtain comparable performance for most languages, which is interesting. This pre-training method can significantly reduce catastrophic mistranslations and toxicity. Methods using permuted binary trees seem to be most effective in that sense. Authors did not show the performance of the proposed methods in relation to toxicity in translation (i.e. no qualitative evaluation). --------------------------------------------------------------------------- Reasons to reject --------------------------------------------------------------------------- This is a good paper as it attempts to mitigate a crucial problem in translation. However, it is not good enough for being published in ACL. The writing is casual in some places, e.g. some figures are shown nowhere near to the descriptions of the figures. Moreover, texts that should be part of the paper are in fact written in Appendices. These have hampered flow of reading. --------------------------------------------------------------------------- Questions for the Author(s) --------------------------------------------------------------------------- Question A: As for obfuscated parallel data, what is the rational for choosing the values for R? Question B: There may have many SMT phrases (e.g. nested or entire sentence) for a single sentence-pair. How did you decide which prhases to cosider for generating synthetic sentence-pair? --------------------------------------------------------------------------- Missing References --------------------------------------------------------------------------- check Aji et al. 2020a and b. They are same paper. --------------------------------------------------------------------------- --------------------------------------------------------------------------- Reviewer's Scores --------------------------------------------------------------------------- Soundness: 3 Excitement (Long paper): 2.5 Reviewer Confidence: 4 Recommendation for Best Paper Award: No Reproducibility: 3 Ethical Concerns: No