Meta Review of Paper1047 by Area Chair GREC ACL ARR 2023 December Paper1047 Area Chair GREC 06 Feb 2024 (modified: 06 Feb 2024)ACL ARR 2023 December Paper1047 Meta ReviewReaders: Paper1047 Senior Area Chairs, Paper1047 Area Chairs, Paper1047 Authors, Paper1047 Reviewers Submitted, Program ChairsShow Revisions Metareview: The paper focuses on LLM alignment via RLAIF. It compares DPO and SLiC to supervised fine-tuning. Instruction pairs are extracted from existing LLMs of varying strength (InstructGPT, ChatGPT and GPT-4). The paper is very much focused on the comparison against different methods such as SLiC and DPO rather that proposing any significant novel methods of its own. Summary Of Reasons To Publish: Automatic post-training methods are of value and the investigation of the particular methods presented here would be a valuable reference for the community. The quality and scale of the analysis is high, the effectiveness of the method is tested with extensive experiments and detailed analysis. Summary Of Suggested Revisions: The earlier part of the paper is well written, but in particular the content towards the end and the conclusions in particular could do with a rewriting. Most reviewers raise the issue of novelty, since the paper only recombines existing strategies (SLiC, DPO) and performs RLAIF via extracting instruction pairs from InstructGPT/ChatGPT/GPT-4. The results are still useful, timely and relevant. Several other issues are noted by the reviewers: description of Pair Construction could be more detailed, the definitions of hard/easy pairs could be made more clear. Structure of the paper regarding prior work should be improved. Curriculums should be discussed, etc. Several results (e.g. Table 3) need more discussion. In the next iteration (or final version) the authors should thoroughly revise the comments and their own rebuttal and integrate it into the paper. The Limitations section is compulsory for *ACL conferences and should be added at the end of the paper. Overall Assessment: 3 = There are major points that may be revised Best Paper Ae: No Ethical Concerns: None Needs Ethics Review: No Author Identity Guess: 1 = I do not have even an educated guess about author identity. Add [–] Official Review of Paper1047 by Reviewer vRwJ ACL ARR 2023 December Paper1047 Reviewer vRwJ 21 Jan 2024ACL ARR 2023 December Paper1047 Official ReviewReaders: Program Chairs, Paper1047 Senior Area Chairs, Paper1047 Area Chairs, Paper1047 Reviewers Submitted, Paper1047 AuthorsShow Revisions Paper Summary: This paper presents an empirical investigation of contrastive pair tuning of LLMs with a focus on methods which can leverage pairs obtained from different classes of existing LLMs that have a commonly understood 'ranking' of quality. The paper is very much focused on the comparison against different methods such as SLiC and DPO rather that proposing any significant novel methods of its own. Summary Of Strengths: Automatic post-training methods are of value and the investigation of the particular methods presented here would be a valuable reference for the community. The quality and scale of the analysis is hight. The overall description of the goals of the work are straightforward and approachable. Replicability is high, but that is due to the relative simplicity of the work presented. Summary Of Weaknesses: My biggest challenge with this paper is the relative lack of novelty in the work presented. I have no doubt tyhat some considerable effort has gone into the preparation of this paper, but in the end the contributions made and the refinement of the paper do leave some room for improvement. Comments, Suggestions And Typos: The earlier part of the paper is well written, but in particular the content towards the end and the conclusions in particular could do with a rewriting. Soundness: 4 = Strong: This study provides sufficient support for all of its claims/arguments. Some extra experiments could be nice, but not essential. Overall Assessment: 3 = Good: This paper makes a reasonable contribution, and might be of interest for some (broad or narrow) sub-communities, possibly with minor revisions. Confidence: 3 = Pretty sure, but there's a chance I missed something. Although I have a good feel for this area in general, I did not carefully check the paper's details, e.g., the math or experimental design. Best Paper: No Ethical Concerns: There are no direct ethical considerations. Needs Ethics Review: No Reproducibility: 3 = They could reproduce the results with some difficulty. The settings of parameters are underspecified or subjectively determined, and/or the training/evaluation data are not widely available. Datasets: 2 = Documentary: The new datasets will be useful to study or replicate the reported research, although for other purposes they may have limited interest or limited usability. (Still a positive rating) Software: 1 = No usable software released. Author Identity Guess: 1 = I do not have even an educated guess about author identity. Add [–] Response to Reviewer vRwJ ACL ARR 2023 December Paper1047 AuthorsCanwen Xu(privately revealed to you) 25 Jan 2024ACL ARR 2023 December Paper1047 Official CommentReaders: Program Chairs, Paper1047 Senior Area Chairs, Paper1047 Area Chairs, Paper1047 Reviewers Submitted, Paper1047 AuthorsShow Revisions Comment: Thanks for your insightful comment. We acknowledge the concern that our paper does not provide a new method for aligning LLMs. However, we would like to first highlight the nature of an empirical study. Also, our paper provides insights for constructing pairwise data, which plays an important role in data-centric era of LLM training and aligning. Add [–] Official Review of Paper1047 by Reviewer zqgn ACL ARR 2023 December Paper1047 Reviewer zqgn 21 Jan 2024ACL ARR 2023 December Paper1047 Official ReviewReaders: Program Chairs, Paper1047 Senior Area Chairs, Paper1047 Area Chairs, Paper1047 Reviewers Submitted, Paper1047 AuthorsShow Revisions Paper Summary: The author proposes a method for Automatic Pair Construction for Contrastive Post-training that can quickly construct pair samples. The authors compare the contrastive techniques of SLiC and DPO to SFT baselines and find that DPO provides a step-function improvement even after continuing SFT saturates. Summary Of Strengths: The author proposes a method for Automatic Pair Construction for Contrastive Post-training that can quickly construct pair samples. The author has validated the effectiveness of the method with extensive experiments and has provided a detailed analysis. Summary Of Weaknesses: The innovation is insufficient; although the Automatic Pair Construction for Contrastive Post-training method proposed by the author is effective, its novelty does not suffice to support an extensive paper. The description of Pair Construction is not detailed enough; the definitions of hard pair and easy pair are not clear, and there is no analysis of the impact of the quantity of hard pairs and easy pairs on the model. Using Contrastive Post-training will change the training samples for DPO, therefore, a direct comparison with DPO is not entirely fair. In the latest algorithm KTO, Pair samples are no longer needed, which will diminish the value of this method. Comments, Suggestions And Typos: None Soundness: 2 = Poor: Some of the main claims/arguments are not sufficiently supported. There are major technical/methodological problems. Overall Assessment: 2 = Revisions Needed: This paper has some merit, but also significant flaws, and needs work before it would be of interest to the community. Confidence: 4 = Quite sure. I tried to check the important points carefully. It's unlikely, though conceivable, that I missed something that should affect my ratings. Best Paper: No Limitations And Societal Impact: They have no limitations Ethical Concerns: no Needs Ethics Review: No Reproducibility: 3 = They could reproduce the results with some difficulty. The settings of parameters are underspecified or subjectively determined, and/or the training/evaluation data are not widely available. Datasets: 1 = No usable datasets submitted. Software: 1 = No usable software released. Author Identity Guess: 1 = I do not have even an educated guess about author identity. Add [+] Author-Editors Confidential Comment by Paper1047 Authors • Confidential Comment to Editors [–] Response to Reviewer zqgn ACL ARR 2023 December Paper1047 AuthorsCanwen Xu(privately revealed to you) 25 Jan 2024ACL ARR 2023 December Paper1047 Official CommentReaders: Program Chairs, Paper1047 Senior Area Chairs, Paper1047 Area Chairs, Paper1047 Reviewers Submitted, Paper1047 AuthorsShow Revisions Comment: Thanks for your insightful comments. The innovation is insufficient; although the Automatic Pair Construction for Contrastive Post-training method proposed by the author is effective, its novelty does not suffice to support an extensive paper. We would like to first highlight the nature of an empirical study. Also, our paper provides insights for constructing pairwise data, which plays an important role in data-centric era of LLM training and aligning. These experiments are expensive and our conclusions are helpful for future works. The description of Pair Construction is not detailed enough; the definitions of hard pair and easy pair are not clear, and there is no analysis of the impact of the quantity of hard pairs and easy pairs on the model. Our process for pair construction is clearly described in Section 4. To summarize in a sentence, we use a "better" model's output as the positive and a "worse" model's output as negative. Similarly, hard pair means outputs of less distinguishable models (GPT-4 and GPT-3.5-turbo) and easy pair means more distinguishable models (GPT-4 and InstructGPT). The quantity is not a focus in our paper. It's widely recognized that more training data often leads to a better performance. Using Contrastive Post-training will change the training samples for DPO, therefore, a direct comparison with DPO is not entirely fair. We do not propose an alternative to DPO. Instead, we only investigate how to construct pairs for DPO. In the latest algorithm KTO, Pair samples are no longer needed, which will diminish the value of this method. KTO requires human feedback on whether an output is good without constructing a pair. Our work intends to automatically construct pairs for contrastive post-training. These two works are orthogonal. Add [–] Official Review of Paper1047 by Reviewer XxXE ACL ARR 2023 December Paper1047 Reviewer XxXE 20 Jan 2024ACL ARR 2023 December Paper1047 Official ReviewReaders: Program Chairs, Paper1047 Senior Area Chairs, Paper1047 Area Chairs, Paper1047 Reviewers Submitted, Paper1047 AuthorsShow Revisions Paper Summary: This paper proposes a new method for LLM post-training, combining RLAIF with contrastive feedback. The paper proposes using two teacher models of varying competence to provide positive and negative samples to be used in the DPO algorithm. Thus, eliminating the need for human-annotated contrastive pairs. The paper compares three post-training methods, RLAIF, SliC, and DPO. First, the paper finds that training using the proposed approach is able to achieve the best performance. Second, the paper proposes using a curriculum learning approach, defining easy contrastive pairs as ones produced by teacher models of widely varying performance (InstructGPT vs GPT-4), and har pairs as produced by closely performing teachers (ChatGPT vs InstructGPT). The find that training using the easy pairs first and then hard pairs may offer improved performance. Summary Of Strengths: The paper addresses the problem of expensive human annotations needed for post-training and proposes and effective alternative. The paper introduces a useful definition and baseline for curriculum learning in post-training LLMs. Summary Of Weaknesses: The paper describes the need for post-training to align LLM generations with human values and expectations but fails to discuss how the alternative solution of post-training using AI-generated positive and negative samples achieves that goal. For example, does the paper assume that the teacher model has been aligned through human feedback? The paper is missing a direct comparison with prior work, in terms of performance difference against their full training approaches (e.g. using human-annotated preferences), vs. the reduced cost incurred by the present work. Sections 1,2,3, and 4 repeatedly re-introduce the prior work and its limitations, with minor added details. The paper reads as repetitive and the presentation may be improved. The text is missing any mention or discussion of curriculums (5) and (6) from Table 7. These results may challenge the paper's claims as anti-curriculum outperforms curriculum, and neither performs better than the standard training baseline. Comments, Suggestions And Typos: The paper makes a point about ChatGPT targets being easier to imitate, and GPT-4 outputs being more challenging (line 338). This point requires more justification. Although GPT-4 outputs are more challenging for a model to come up with, they may not be harder to imitate. It is not clear to me why curricula (1) and (2) from Table 7 perform worse than the baselines in rows 1 and 2 of Table 3. Soundness: 2.5 Overall Assessment: 3 = Good: This paper makes a reasonable contribution, and might be of interest for some (broad or narrow) sub-communities, possibly with minor revisions. Confidence: 3 = Pretty sure, but there's a chance I missed something. Although I have a good feel for this area in general, I did not carefully check the paper's details, e.g., the math or experimental design. Best Paper: No Ethical Concerns: None Needs Ethics Review: No Reproducibility: 4 = They could mostly reproduce the results, but there may be some variation because of sample variance or minor variations in their interpretation of the protocol or method. Datasets: 1 = No usable datasets submitted. Software: 1 = No usable software released. Author Identity Guess: 1 = I do not have even an educated guess about author identity. Add [–] Response to Reviewer XxXE ACL ARR 2023 December Paper1047 AuthorsCanwen Xu(privately revealed to you) 25 Jan 2024ACL ARR 2023 December Paper1047 Official CommentReaders: Program Chairs, Paper1047 Senior Area Chairs, Paper1047 Area Chairs, Paper1047 Reviewers Submitted, Paper1047 AuthorsShow Revisions Comment: We would like to thank the reviewer for your insightful comments. The paper describes the need for post-training to align LLM generations with human values and expectations but fails to discuss how the alternative solution of post-training using AI-generated positive and negative samples achieves that goal. For example, does the paper assume that the teacher model has been aligned through human feedback? This is a good question. As recent works (Vicuna, Orca, Phi etc.) have shown, even without human feedback, a model can be aligned with only SFT. Thus, we do not assume the positive and negative teacher models are aligned through RLHF or other kinds of human feedback. We only assume that one model is better aligned with another, which is easy to satisfy. The paper is missing a direct comparison with prior work, in terms of performance difference against their full training approaches (e.g. using human-annotated preferences), vs. the reduced cost incurred by the present work. It is hard to control the same amount of data especially extensive human labor is required to label the same data as we used in our work. However, in Table 5, we provide baselines trained with RLHF using off-the-shelf reward models. Sections 1,2,3, and 4 repeatedly re-introduce the prior work and its limitations, with minor added details. The paper reads as repetitive and the presentation may be improved. Thanks for your suggestion. We will keep improving the presentation. The text is missing any mention or discussion of curriculums (5) and (6) from Table 7. These results may challenge the paper's claims as anti-curriculum outperforms curriculum, and neither performs better than the standard training baseline. Curriculum (5) is easy->hard (starting with a more distinguishable pair) and (6) is hard->easy. These results are consistent with curriculum (3) and (4) in trend. Add [–] Official Review of Paper1047 by Reviewer Euk9 ACL ARR 2023 December Paper1047 Reviewer Euk9 07 Jan 2024ACL ARR 2023 December Paper1047 Official ReviewReaders: Program Chairs, Paper1047 Senior Area Chairs, Paper1047 Area Chairs, Paper1047 Reviewers Submitted, Paper1047 AuthorsShow Revisions Paper Summary: The paper explores contrastive post-training techniques for aligning large language models (LLMs) with human preferences, focusing on automatically constructing preference pairs from models of varying strengths. It compares contrastive techniques such as Sequence Likelihood Calibration (SLiC) and Direct Preference Optimization (DPO) against supervised fine-tuning (SFT) and examines a data curriculum learning scheme to improve alignment. The study finds that DPO offers improvements in alignment efficiency and effectiveness, particularly when used with a curriculum learning approach​​​​. Summary Of Strengths: The author attempts to use the outputs of three different Large Language Models (LLMs) as preference data, a method worth exploring due to its simplicity and potential to decrease dependence on human feedback. The author also demonstrates that implementing a curriculum-based strategy can enhance the effectiveness of human preference alignment method. Summary Of Weaknesses: (1) The main contribution of the paper is not clearly defined or appears to be minor, focusing primarily on the creation of new preference data and comparing only two existing contrastive post-training method. My recommendation is to accept this as a short paper. (2) The data in Table 1 indicates that no model consistently outperforms others with a win rate close to 99%. However, the author adopts an approach where the output from the model with a marginally higher win rate is automatically considered the positive sample. This method can introduce significant noise, as it overlooks instances where the output from a model with a lower win rate may actually be superior. Such an approach could potentially undermine the tuning process, (3) The conclusions drawn in the paper seem to lack solid grounding, and conclusions are not evident and have a large randomness. One example is that the results in Table 3 (three rows(DPO, LLAMA, training target) are worse than (SFT, LLAMA, GPT4)) seems to contradict with the conclusion in introduction "contrastive post-training techniques maintain a step-function advantage over continuous supervised fine-tuning". (4) The sole reliance on GPT-4 for evaluation could lead to unreliable or biased results. Can the author think out new evaluation methods? Comments, Suggestions And Typos: line 046 an llm line 204 reward model r( |y) Soundness: 2.5 Overall Assessment: 2.5 Confidence: 3 = Pretty sure, but there's a chance I missed something. Although I have a good feel for this area in general, I did not carefully check the paper's details, e.g., the math or experimental design. Best Paper: No Ethical Concerns: None Needs Ethics Review: No Reproducibility: 3 = They could reproduce the results with some difficulty. The settings of parameters are underspecified or subjectively determined, and/or the training/evaluation data are not widely available. Datasets: 3 = Potentially useful: Someone might find the new datasets useful for their work. Software: 1 = No usable software released. Author Identity Guess: 1 = I do not have even an educated guess about author identity. Add [–] Response to Reviewer Euk9 ACL ARR 2023 December Paper1047 AuthorsCanwen Xu(privately revealed to you) 25 Jan 2024ACL ARR 2023 December Paper1047 Official CommentReaders: Program Chairs, Paper1047 Senior Area Chairs, Paper1047 Area Chairs, Paper1047 Reviewers Submitted, Paper1047 AuthorsShow Revisions Comment: We would like to thank the reviewer for your insightful comments. The data in Table 1 indicates that no model consistently outperforms others with a win rate close to 99%. However, the author adopts an approach where the output from the model with a marginally higher win rate is automatically considered the positive sample. This method can introduce significant noise, as it overlooks instances where the output from a model with a lower win rate may actually be superior. Such an approach could potentially undermine the tuning process, We acknowledge the noise in the data and we believe this level of noise is considered acceptable. Human annotators also have some level of disagreement when labeling their preference. A robust training apporach, for example DPO, is able to tolerate the noise and still obtains improvement from contrastive training. Meanwhile, as there're more and more models available, the construction will be easier since we can find more diverse and pairs with less noise. The conclusions drawn in the paper seem to lack solid grounding, and conclusions are not evident and have a large randomness. One example is that the results in Table 3 (three rows(DPO, LLAMA, training target) are worse than (SFT, LLAMA, GPT4)) seems to contradict with the conclusion in introduction "contrastive post-training techniques maintain a step-function advantage over continuous supervised fine-tuning". DPO can improve on top of SFT while another epoch of SFT does not. We will clarify this in the revision. The sole reliance on GPT-4 for evaluation could lead to unreliable or biased results. Can the author think out new evaluation methods? We agree with the reviewer on this issue. However, many benchmarks for alignment (e.g., Alpaca Eval, MT-Bench) rely on GPT-4. A better yet expensive way may be introducing human labelers. Add [–] Official Review of Paper1047 by Reviewer xewV ACL ARR 2023 December Paper1047 Reviewer xewV 05 Jan 2024ACL ARR 2023 December Paper1047 Official ReviewReaders: Program Chairs, Paper1047 Senior Area Chairs, Paper1047 Area Chairs, Paper1047 Reviewers Submitted, Paper1047 AuthorsShow Revisions Paper Summary: This paper proposes to contrastive post train large language models with responses of strong and comparably weak models, and explores a curriculum learning strategy to gradually increases the amount of less distinctive data with training. Experiments show that the approach can lead to better performance than supervised fine-tuning. Summary Of Strengths: 1, this paper proposes to post train on outputs of different large language models with contrastive loss, and obtains better performance than supervised fine-tuning. 2, this paper designs a curriculum learning strategy for the contrastive post training which leads to further improvements. Summary Of Weaknesses: 1, this paper mentions that calling the API makes it expensive to substitute human annotators with pre-aligned models, and the presented approach suffers from the same issue. 2, the improvements over supervised fine-tuning are not satisfactory, 62.5% of win ratio compared to supervised fine-tuning on GPT-4 is close to tie (around 3 of 5). 3, the effectiveness with less strong superior LLM is unclear, and it is possible that the effectiveness of the paper relies heavily on the strong GPT-4. 4, the effectiveness of the method with less distinctive models (e.g., having a win ratio of ~70%) are also unclear. 5, the data curriculum approach is quite simple and the paper lacks comparison with the other curriculum learning method. Comments, Suggestions And Typos: 1, I wonder the performance of GPT-4 vs. ChatGPT as hard pairs in the data curriculum, it is harder than ChatGPT vs. InstructGPT. 2, it is suggested to also explore the effectiveness with less strong superior model. Soundness: 2.5 Overall Assessment: 2.5 Confidence: 3 = Pretty sure, but there's a chance I missed something. Although I have a good feel for this area in general, I did not carefully check the paper's details, e.g., the math or experimental design. Best Paper: No Ethical Concerns: None Needs Ethics Review: No Reproducibility: 3 = They could reproduce the results with some difficulty. The settings of parameters are underspecified or subjectively determined, and/or the training/evaluation data are not widely available. Datasets: 1 = No usable datasets submitted. Software: 1 = No usable software released. Author Identity Guess: 1 = I do not have even an educated guess about author identity. Add [–] Response to Reviewer xewV ACL ARR 2023 December Paper1047 AuthorsCanwen Xu(privately revealed to you) 25 Jan 2024ACL ARR 2023 December Paper1047 Official CommentReaders: Program Chairs, Paper1047 Senior Area Chairs, Paper1047 Area Chairs, Paper1047 Reviewers Submitted, Paper1047 AuthorsShow Revisions Comment: We would like to thank the reviewer for your insightful comments. this paper mentions that calling the API makes it expensive to substitute human annotators with pre-aligned models, and the presented approach suffers from the same issue. These model outputs can be obtained cheaply from saved chat history. Obtaining a inferior model's output is cheaper. the improvements over supervised fine-tuning are not satisfactory, 62.5% of win ratio compared to supervised fine-tuning on GPT-4 is close to tie (around 3 of 5). The improvement is statistically significant. Also, this level of improvement is often expected in RLHF, DPO, etc. the effectiveness with less strong superior LLM is unclear, and it is possible that the effectiveness of the paper relies heavily on the strong GPT-4. Our finding on curriculum learning contradicts this hypothesis. Using GPT-4 vs InstructGPT at first and gradually transitioning to GPT-3.5-turbo vs InstructGPT is better than using GPT-4 all the time. the data curriculum approach is quite simple and the paper lacks comparison with the other curriculum learning method. Heuristically reordering the examples is a "free lunch" in contrastive learning and our investigation aims to unveil that. Add [–] Supplementary Materials by Program Chairs ACL ARR 2023 December Program Chairs 16 Dec 2023ACL ARR 2023 December Paper1047 Supplementary MaterialsReaders: Program Chairs, Paper1047 Reviewers, Paper1047 Authors, Paper1047 Area Chairs, Paper1047 Senior Area ChairsShow Revisions Responsible NLP Research: pdf Note From EiCs: These are the confidential supplementary materials of the submission. If you see no entries in this comment, this means there haven't been submitted any.