Meta Review of Paper1770 by Area Chair mboG ACL ARR 2023 December Paper1770 Area Chair mboG 02 Feb 2024ACL ARR 2023 December Paper1770 Meta ReviewReaders: Paper1770 Senior Area Chairs, Paper1770 Area Chairs, Paper1770 Authors, Paper1770 Reviewers Submitted, Program ChairsShow Revisions Metareview: This paper aims to study alignment through a causality perspective. What it boils down to is an easy tweak to PPO---the loss over each training example if weighted by the reward delta between before and after the intervention. Experiments on a couple of tasks and pretrained models show that the proposed method outperforms baselines. Overall, I believe is ready for XACL conferences with minor revisions. Summary Of Reasons To Publish: Looking at RLHF through the causality perspective is novel and interesting As all reviewers agree, the paper does a great job walking the readers through causality, a relatively less popular topic in our field Most reviewers find the results convincing, and believe that the proposed method can be very useful. Summary Of Suggested Revisions: The authors did a great job addressing the reviewers' concerns. One obvious suggestion is to include the new results in the revision. The paper can be improved by expanding its discussion of related works, which is already happening in the discussion with some of the reviewers. Overall Assessment: 4 = There are minor points that may be revised Best Paper Ae: No Ethical Concerns: None Needs Ethics Review: No Author Identity Guess: 1 = I do not have even an educated guess about author identity. Great Reviews: HqTZ Add [–] A Kind Reminder for Response Discussion ACL ARR 2023 December Paper1770 AuthorsYu Xia(privately revealed to you) 28 Jan 2024ACL ARR 2023 December Paper1770 Official CommentReaders: Program Chairs, Paper1770 Senior Area Chairs, Paper1770 Area Chairs, Paper1770 Reviewers, Paper1770 Reviewers Submitted, Paper1770 AuthorsShow Revisions Comment: Dear Reviewers, Thank you once again for your valuable feedback and insightful suggestions. As the discussion period is nearing its end on January 29th, we kindly ask if you have any further questions regarding our responses or any additional queries that we could address. Add [–] Official Review of Paper1770 by Reviewer HqTZ ACL ARR 2023 December Paper1770 Reviewer HqTZ 19 Jan 2024 (modified: 29 Jan 2024)ACL ARR 2023 December Paper1770 Official ReviewReaders: Program Chairs, Paper1770 Senior Area Chairs, Paper1770 Area Chairs, Paper1770 Reviewers Submitted, Paper1770 AuthorsShow Revisions Paper Summary: The paper tries to take the source of bias into consideration (specifically, training data and input prompt) when aligning LLMs so that the outputs are less biased. To do alignment, the authors adaptly weight biased examples during RL. The weight is essentially based on Equation (1) which is the difference between rewards of LLM outputs before intervention (using theta_init as the LLM parameter) and rewards of LLM outputs after intervention (using some other theta as LLM parameter). The authors’ approach is named causality-aware alignment (CAA). The authors experiment on two tasks (text continuation using IMDB review and RealToxicityPrompt, summarization using the extractive CNN/DM dataset and the abstractive XSum dataset). The base models are GPT-2 (for text continuation tasks) and T5 (for summarization tasks), and the baselines are SFT (detailed seem to be omitted from paper), PPO, and the base models themselves. All reward models used are on HuggingFace (although no link is provided). The authors find that CAA outperforms PPO in respective metrics. Summary Of Strengths: Looking at biases through a causality perspective hasn’t been investigated enough in the literature, and it’s great the authors are pursuing this path. I appreciate the authors for working on better learning objectives in general. The introduction to causality is helpful, and the figures are helpful. Results seem promising at first glance, assuming that training has converged. Summary Of Weaknesses: Relatively small concern: Base models are quite small (GPT-2, T5). The number of tuning steps is quite small in my opinion, and it’s unclear what the trends are after training for longer. This is one of my biggest concern. More tuning (e.g., learning rate) would also be helpful to know whether CAA is actually better than PPO. More ablations will be great (esp. with different coefficients for KL penalty). More discussion in related work would be great: It would be great if the authors explicitly mention the difference between this approach and the approach Zhou et al. (2023) which the authors cited; the tasks may be different but the motivation and approaches have quite some similarities with the authors’ May be worth mentioning other earlier work in causal analysis of social bias (e.g., Vig et al., 2020; https://proceedings.neurips.cc/paper_files/paper/2020/file/92650b2e92217715fe312e6fa7b90d82-Paper.pdf) Another line of work that does weight the loss by some reward during alignment: Baheti et al. (2023) https://arxiv.org/pdf/2305.14718.pdf ...as well as a baseline in Baheti et al.: GOLD (https://arxiv.org/pdf/2009.07839.pdf) There needs to be more explanation of SFT in the main text – my understanding is that these days SFT may involve sequences with high rewards as well. Comments, Suggestions And Typos: Line 033: please fix “K et al” Line 076: not immediately clear to me what “between LLMs and model outputs” mean Minor issue (personal opinion): I understand that many notations are related to causality, but I think it may be clear to readers if do(...) notation is removed in Section 3.2 – just say you perturb the LLM parameter somehow Soundness: 3 = Acceptable: This study provides sufficient support for its major claims/arguments. Some minor points may need extra support or details. Overall Assessment: 3 = Good: This paper makes a reasonable contribution, and might be of interest for some (broad or narrow) sub-communities, possibly with minor revisions. Confidence: 3 = Pretty sure, but there's a chance I missed something. Although I have a good feel for this area in general, I did not carefully check the paper's details, e.g., the math or experimental design. Best Paper: No Ethical Concerns: None Needs Ethics Review: No Reproducibility: 3 = They could reproduce the results with some difficulty. The settings of parameters are underspecified or subjectively determined, and/or the training/evaluation data are not widely available. Datasets: 1 = No usable datasets submitted. Software: 1 = No usable software released. Author Identity Guess: 1 = I do not have even an educated guess about author identity. Add [–] Reply to Reviewer HqTZ ACL ARR 2023 December Paper1770 AuthorsYu Xia(privately revealed to you) 26 Jan 2024ACL ARR 2023 December Paper1770 Official CommentReaders: Program Chairs, Paper1770 Senior Area Chairs, Paper1770 Area Chairs, Paper1770 Reviewers Submitted, Paper1770 AuthorsShow Revisions Comment: Thank you for your time and efforts in reviewing our paper. We provide the point-by-point responses as follows: Response to Weakness 1 (Model Size): Thanks for pointing this out. In our experiments, we follow the setup in [1, 2] to choose the base models and corresponding downstream tasks. Though with relatively small LLMs, we believe our results are still reflective and helpful in validating the effectiveness of our proposed method. Response to Weakness 2 (Amount of Tuning): Thank you for your valuable suggestion. In our experiments, we report the results after a similar amount of RL finetuning steps as in [2], where our proposed CAA has shown clear advantages over PPO. As it is indeed helpful to learn the trend with more training, we report the logged performances of CAA and PPO on validation set of RealToxicityPrompts dataset across more finetuning steps as below, which further demonstrates the advantage of our method: Toxicity-R@100 (Reported in Figure 4b) Toxicity-R@300 (Reported in Figure 4b) Toxicity-R@500 (Reported in Figure 4b) Toxicity-R@700 Toxicity-R@900 Toxicity-R@1100 PPO 0.124 0.119 0.113 0.112 0.109 0.110 CAA 0.114 0.088 0.074 0.076 0.073 0.074 *Lower Toxicity-R indicates better performance. Response to Weakness 3 (Ablations): Thank you for your suggestion. We reports the performances of CAA and PPO on test set of RealToxicityPrompts dataset with KL coefficient equals to 0 and 0.3 as below: Toxicity-R ↓ Perplexity ↓ PPO (kl=0) 0.034 88.92 CAA (kl=0) 0.025 90.23 PPO (kl=0.3) [Reported in Table 1] 0.083 31.34 CAA (kl=0.3) [Reported in Table 1] 0.049 30.91 The results show that with smaller KL coefficients, both PPO and CAA achieve better reward scores but worse perplexity scores, as models diverge too far from the original. This phenomenon is also observed in [1]. We will add some discussion on the choice of KL coefficient in Section 4.4. Response to Weakness 4 (Related Work): Thank you for your valuable suggestion. While [3] also proposes a causal debiasing framework for pretrained language models, there are three major differences between it and our work: Task-wise: [3] studies the biases of BERT-like language models in mainly text classification tasks, while our work focuses on the biases of LLMs (GPT-2, T5) in text generation tasks. This difference also makes [3] not directly comparable to our work. Problem-wise: [3] presents a causal debiasing approach that addresses attribute biases using human-crafted bias word lists, while our work employs a target reward model capable of handling more general biases. Method-wise: [3] utilizes counterfactual augmentation to create interventional data distributions during supervised learning, whereas our work leverages the reward model as an instrumental variable for providing interventional feedback in reinforcement learning. We will include more detailed discussions of related works as you suggested in Section 2. Responses to Weakness 5 (SFT Detail): Thank you for your feedback. The SFT baseline in our experiments is the standard supervised finetuning on full training samples without reward-based filtering following the setup in [2]. We will include this information in Section 4.3. Responses to Comments and Typos: Thank you for your careful reading and helpful suggestions. We will fix them accordingly. [1] Rajkumar Ramamurthy, Prithviraj Ammanabrolu, Kianté Brantley, Jack Hessel, Rafet Sifa, Christian Bauckhage, Hannaneh Hajishirzi, and Yejin Choi. Is reinforcement learning (not) for natural language processing: Benchmarks, baselines, and building blocks for natural language policy optimization. ICLR 2023. [2] Zeqiu Wu, Yushi Hu, Weijia Shi, Nouha Dziri, Alane Suhr, Prithviraj Ammanabrolu, Noah A Smith, Mari Ostendorf, and Hannaneh Hajishirzi. Fine-grained human feedback gives better rewards for language model training. NeurIPS 2023. [3] Fan Zhou, Yuzhou Mao, Liu Yu, Yi Yang, and Ting Zhong. 2023. Causal-Debias: Unifying Debiasing in Pretrained Language Models and Fine-tuning via Causal Invariant Learning. ACL 2023. Add [–] Reviewer response ACL ARR 2023 December Paper1770 Reviewer HqTZ 29 Jan 2024ACL ARR 2023 December Paper1770 Official CommentReaders: Program Chairs, Paper1770 Senior Area Chairs, Paper1770 Area Chairs, Paper1770 Reviewers Submitted, Paper1770 AuthorsShow Revisions Comment: Thanks for the additional results. They are very helpful; e.g., the "response to weakness 3 (ablations)" table. For "response to weakness 2 (amount of tuning)," I'm quite surprised that for PPO, toxicity improves so slowly. What if you train both PPO and CAA for even longer? I'm also curious whether the quality of the generated texts are good (do they look fluent, sensible, etc., or is there some reward hacking going on)? The response addresses most of my concerns and I'm thinking of raising the score. Add [–] Thank you for your reply ACL ARR 2023 December Paper1770 AuthorsYu Xia(privately revealed to you) 29 Jan 2024 (modified: 29 Jan 2024)ACL ARR 2023 December Paper1770 Official CommentReaders: Program Chairs, Paper1770 Senior Area Chairs, Paper1770 Area Chairs, Paper1770 Reviewers Submitted, Paper1770 AuthorsShow Revisions Comment: Thank you again for your valuable feedback and insightful suggestions. We are pleased to know that you found our response helpful and appreciate the raised score. We provide the response to your additional questions as below: I'm quite surprised that for PPO, toxicity improves so slowly. What if you train both PPO and CAA for even longer? The toxicity improvements of PPO and CAA slow down after 500 RL finetuning steps and both of their performances become stable thereafter as shown in the table of our response to weakness 2. The slow improvements of PPO in reducing toxicity have similarly been observed in [2]. We attribute the slow improvements of PPO to the influence of the KL penalty in Equation 2 of our paper. The KL penalty regularizes the model optimization to ensure that the finetuned LLM does not deviate too far from the initial LLM. In our experiments, we follow the setup in [2] to conservatively choose a KL coefficient of 0.3 for both PPO and CAA, to ensure that the finetuned LLM maintains the fluency and quality of the generated texts, as evaluated in Table 1 of our paper. While setting a lower KL coefficient will indeed lead to faster and further improvements on toxicity, this would trade off the fluency and quality of the generated texts as shown in the table of our response to weakness 3. I'm also curious whether the quality of the generated texts are good (do they look fluent, sensible, etc., or is there some reward hacking going on)? With our choice of KL coefficient as discussed above, the generated texts for both PPO and CAA maintains good fluency (Perplexity) and token diversity (Distinct-1, Distinct-2) in our experiments as reported in Table 1 of paper. We can also observe that, with comparable text quality, CAA consistently outperforms PPO in the alignment objective, e.g., reducing the toxicity. This further demonstrates the advantages of our proposed method. [2] Zeqiu Wu, Yushi Hu, Weijia Shi, Nouha Dziri, Alane Suhr, Prithviraj Ammanabrolu, Noah A Smith, Mari Ostendorf, and Hannaneh Hajishirzi. Fine-grained human feedback gives better rewards for language model training. NeurIPS 2023. Add [–] Official Review of Paper1770 by Reviewer fTmp ACL ARR 2023 December Paper1770 Reviewer fTmp 19 Jan 2024 (modified: 20 Jan 2024)ACL ARR 2023 December Paper1770 Official ReviewReaders: Program Chairs, Paper1770 Senior Area Chairs, Paper1770 Area Chairs, Paper1770 Reviewers Submitted, Paper1770 AuthorsShow Revisions Paper Summary: This paper proposes interpreting LLMs and alignment through a causal lens, and exploits this view to derive a new method for LLM alignment (Causality-aware alignment; CCA). Ultimately (unless I misunderstood something), the idea boils down to including per-sample weights on the PPO loss terms, where the weight is high when there's a big difference in the reward assigned to a completion by the aligned LLM vs. a completion by the original (SFTed) LLM. Summary Of Strengths: The results suggest that CCA performs better than SFT and PPO for alignment in terms of preserving fluency and reducing toxicity and biased generation. Given how alignment is a critical part of the LLM training pipeline, this indicates the method could be very useful. Summary Of Weaknesses: The text could be clearer at times. While it is generally well-written, I think that the paper could benefit from an expanded discussion on their method and the connection to causal ML. Some methodological choices are not clear to me. For example, it is not clear to me why there should actually be a causal link X -> L. While the argument is that previous work suggests that LLMs can conduct implicit gradient descent, it is not clear whether this actually justifies adding a link in the plate diagram (I would actually have imagined that all the other arrows would be enough). Unless I misunderstood something, it does seem like the method boils down to adding per-sample scaling to the PPO loss. If so, one could argue that the methodological contribution is limited (especially since objectives like DPO can be seen as doing this implicitly). Comments, Suggestions And Typos: Question 1: Is the main change w.r.t. the PPO objective that now we have per-sample scaling, where the scaling is given by Eq. 1? Soundness: 3 = Acceptable: This study provides sufficient support for its major claims/arguments. Some minor points may need extra support or details. Overall Assessment: 2.5 Confidence: 2 = Willing to defend my evaluation, but it is fairly likely that I missed some details, didn't understand some central points, or can't be sure about the novelty of the work. Best Paper: No Ethical Concerns: None Needs Ethics Review: No Reproducibility: 3 = They could reproduce the results with some difficulty. The settings of parameters are underspecified or subjectively determined, and/or the training/evaluation data are not widely available. Datasets: 1 = No usable datasets submitted. Software: 1 = No usable software released. Author Identity Guess: 1 = I do not have even an educated guess about author identity. Add [–] Reply to Reviewer fTmp ACL ARR 2023 December Paper1770 AuthorsYu Xia(privately revealed to you) 26 Jan 2024ACL ARR 2023 December Paper1770 Official CommentReaders: Program Chairs, Paper1770 Senior Area Chairs, Paper1770 Area Chairs, Paper1770 Reviewers Submitted, Paper1770 AuthorsShow Revisions Comment: Thank you for your time and efforts in reviewing our paper. We provide the point-by-point responses as follows: Response to Weakness 1: Thank you for your suggestion. We will check and improve the clarity of our presentation and add more discussion on the connection between our method and causal ML in Section 2.2. Response to Weakness 2: Thank you for your question. As described in Section 3.1, the causal link X -> L represents the influence of the input prompt on LLM, which is validated by the in-context learning phenomenon where LLM is able to learn new tasks from a few demonstration examples in the prompt. As illustrated in Figure 2(d) and also suggested in [1, 2], different choices of demonstration examples in the input prompt will indeed cause different LLM behaviors, which essentially establishes a causal relationship. Response to Weakness 3 and Question 1: Thank you for your insightful comments. Inspired by our causal view, our method utilizes interventional feedback from the reward model as a sample weight. This can indeed be considered as per-sample scaling of the PPO loss based on causal signals during RL fine-tuning. However, besides the proposed training technique, another major contribution of this paper is the proposed structural causal model for LLM text generation detailed in Section 3.2. Extending beyond the simple analogy in prior work [3], we revisit the RLHF process from a causal perspective and analyze the role of the reward model as an instrumental variable intervening the LLM. We believe such a causal understanding is important and beneficial for facilitating more interpretable and accountable LLM alignment and debiasing studies in the future. [1] Damai Dai, Yutao Sun, Li Dong, Yaru Hao, Shuming Ma, Zhifang Sui, and Furu Wei. 2023. Why Can GPT Learn In-Context? Language Models Secretly Perform Gradient Descent as Meta-Optimizers. Findings of ACL 2023. [2] Johannes Von Oswald, Eyvind Niklasson, Ettore Randazzo, Joao Sacramento, Alexander Mordvintsev, Andrey Zhmoginov, Max Vladymyrov. Transformers Learn In-Context by Gradient Descent. ICML 2023. [3] Zhang, C., Bauer, S., Bennett, P., Gao, J., Gong, W., Hilmkil, A., ... & Vaughan, J. (2023). Understanding causality with large language models: Feasibility and opportunities. arXiv preprint arXiv:2304.05524. Add [–] Response ACL ARR 2023 December Paper1770 Reviewer fTmp 30 Jan 2024ACL ARR 2023 December Paper1770 Official CommentReaders: Program Chairs, Paper1770 Senior Area Chairs, Paper1770 Area Chairs, Paper1770 Reviewers Submitted, Paper1770 AuthorsShow Revisions Comment: Thank you for your response and clarifications. Overall, I think that this paper is promising, but I think that it could still benefit from another round of polishing and reviewing (especially if one of the main contributions is the causal interpretation of RLHF). For this reason, I am keeping my score unchanged. Add [–] Thank you for your reply ACL ARR 2023 December Paper1770 AuthorsYu Xia(privately revealed to you) 30 Jan 2024 (modified: 30 Jan 2024)ACL ARR 2023 December Paper1770 Official CommentReaders: Program Chairs, Paper1770 Senior Area Chairs, Paper1770 Area Chairs, Paper1770 Reviewers Submitted, Paper1770 AuthorsShow Revisions Comment: Thank you again for your valuable feedback and recognition of our paper. To provide details of our causal interpretation of RLHF, in Section 3.1, we revisit LLM text generation from a novel causal view, and inspired by our causal view, in Section 3.2, we analyze the causal relations in RL alignment methods (e.g., RLHF) in detail, especially the role of the reward model. We will improve our discussions to further highlight our contributions in this regard. Our causal framework is developed to effectively identify and understand the studied problem, and thus guide our methodology design in Section 3.3. Similar strategy has also been recently investigated as one of the important contributions (i.e., develop novel causal views and inspire new algorithm designs) in previous works, in other NLP domains such as implicit sentiment analysis [4] and prompt-based probing [5]. [4] Siyin Wang, Jie Zhou, Changzhi Sun, Junjie Ye, Tao Gui, Qi Zhang, and Xuanjing Huang. 2022. Causal Intervention Improves Implicit Sentiment Analysis. In Proceedings of the 29th International Conference on Computational Linguistics. [5] Boxi Cao, Hongyu Lin, Xianpei Han, Fangchao Liu, and Le Sun. 2022. Can Prompt Probe Pretrained Language Models? Understanding the Invisible Risks from a Causal View. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics. Add [–] Official Review of Paper1770 by Reviewer PDyp ACL ARR 2023 December Paper1770 Reviewer PDyp 17 Jan 2024ACL ARR 2023 December Paper1770 Official ReviewReaders: Program Chairs, Paper1770 Senior Area Chairs, Paper1770 Area Chairs, Paper1770 Reviewers Submitted, Paper1770 AuthorsShow Revisions Paper Summary: This paper introduces Causality-Aware Alignment (CAA) as a new method to reduce biases in LLM using reinforcement learning and interventional feedback. The paper outlines the causal perspective of text generation and the proposed CAA method to align LLMs to produce less biased outputs. While the approach shows promise, the paper would benefit from including empirical results and comparing CAA with existing debiasing methods for LLMs. Summary Of Strengths: This paper introduces a novel approach CAA, which leverages reinforcement learning and interventional feedback to address biases in LLM. CAA tackles the debiasing problem from a causality perspective, which is interesting to me. This method demonstrates some potential for mitigating biases in LLMs and aligning them to generate less biased outputs. This paper well articulates the method well, outlining the causal perspective of text generation and the proposed CAA method. The use of reinforcement learning and interventional feedback is well-explained. Given the increasing concern about biases/safety in LLMs, the paper's focus is highly relevant and addresses a critical issue in the LLM era. Summary Of Weaknesses: The paper would benefit from including empirical results and performance evaluations of the CAA method. While the proposed approach is promising, empirical evidence would strengthen the paper's contributions and provide insights into the effectiveness of CAA in mitigating biases in LLMs. The paper could benefit from a comparative analysis with existing debiasing methods for LLMs. A comparison with other approaches would provide a clearer understanding of the advantages and limitations of the CAA method of existing techniques. Comments, Suggestions And Typos: The paper would be strengthened by including extra empirical results to demonstrate the effectiveness of the proposed CAA method in reducing biases in large language models. Additionally, a comparative analysis with existing debiasing methods for LLMs would provide valuable insights into the advantages of CAA. Soundness: 3 = Acceptable: This study provides sufficient support for its major claims/arguments. Some minor points may need extra support or details. Overall Assessment: 3 = Good: This paper makes a reasonable contribution, and might be of interest for some (broad or narrow) sub-communities, possibly with minor revisions. Confidence: 4 = Quite sure. I tried to check the important points carefully. It's unlikely, though conceivable, that I missed something that should affect my ratings. Best Paper: No Ethical Concerns: None Needs Ethics Review: No Reproducibility: 3 = They could reproduce the results with some difficulty. The settings of parameters are underspecified or subjectively determined, and/or the training/evaluation data are not widely available. Datasets: 3 = Potentially useful: Someone might find the new datasets useful for their work. Software: 3 = Potentially useful: Someone might find the new software useful for their work. Author Identity Guess: 1 = I do not have even an educated guess about author identity. Add [–] Reply to Reviewer PDyp ACL ARR 2023 December Paper1770 AuthorsYu Xia(privately revealed to you) 26 Jan 2024ACL ARR 2023 December Paper1770 Official CommentReaders: Program Chairs, Paper1770 Senior Area Chairs, Paper1770 Area Chairs, Paper1770 Reviewers Submitted, Paper1770 AuthorsShow Revisions Comment: Thank you for your time and efforts in reviewing our paper. We provide the point-by-point responses as follows: Response to Weakness 1 and Comments: Thank you for your valuable suggestion. In Section 5.3, we present empirical studies that demonstrate the effectiveness of our method in capturing cases where non-toxic prompts lead to toxic outputs due to biases. We will include additional empirical results to offer further insights into our method. Response to Weakness 2 and Comments: Thank you for your suggestion. As mentioned in Section 2.2, a recent language model debiasing work [1] also adopts a causal inference technique. However, [1] relies on human-crafted bias word lists and focuses on addressing those specific attribute biases during fine-tuning. In comparison, our method is more flexible in utilizing a reward model instead of limited bias word lists to address biases in a broader range of tasks. We will include a more detailed discussion of related works in Section 2.2. [1] Fan Zhou, Yuzhou Mao, Liu Yu, Yi Yang, and Ting Zhong. 2023. Causal-Debias: Unifying Debiasing in Pretrained Language Models and Fine-tuning via Causal Invariant Learning. ACL 2023. Add [–] Official Review of Paper1770 by Reviewer EhY9 ACL ARR 2023 December Paper1770 Reviewer EhY9 16 Jan 2024 (modified: 30 Jan 2024)ACL ARR 2023 December Paper1770 Official ReviewReaders: Program Chairs, Paper1770 Senior Area Chairs, Paper1770 Area Chairs, Paper1770 Reviewers Submitted, Paper1770 AuthorsShow Revisions Paper Summary: The authors proposed a new LLM alignment method called Causality-Aware Alignment (CAA), which essentially adds a weight to the Reinforcement Learning (RL) fine-tuning loss post KL divergency penalty following PPO. The weight is based on the reward model reward difference absolute value between initial LLM and during training. They provided conceptual explanation as to why the captured difference can be considered as a variant of causal invariant learning. They tested CAA on text continuation task, and text geneartion task, with 4 different test sets. They record the performance along 3 reward model results, Senti-R, Toxicity-R, and Bias-R. It's demonstrated that CAA consistently out-performs PPO, while maintaining perplexity and diversity. Summary Of Strengths: They authors did a very thorough test of the proposed methods, along 4 different test sets between text continuation and text summarization tasks. They also trained multiple held-out models that has not seen during RL finetuning to avoid potential reward hacking. Summary Of Weaknesses: While the conclusion makes sense directionally, I have have questions. 1) KL divergence captures the token level difference between inital LLM and RL finetuned LLM. It's purpose is to penalize the finetuned LLM from deviating too far away from the initial LLM. Does that count as "interventional feedback" already. 2) Additionally, does KL divergence and introduced "interventional feedback" contradict with each other? or are they along orthogonal dimensions? KL divergence penalize LLM output that is too different from initial LLM, while interventional feedback weights more when reward is different. However, KL divergence is at token level, but introduced interventional feedback is for the overall reward. 3) there is no ablation study for the previous point. It would be great if we could have another model with only Interventional Feedback but no KL penalty. This could help answer the #2 question. 4) Most recent work on RLHF uses datasets like Open Assistant and Anthropic. For Bias and toxic, Open Assistant and Anthropic has Harmful dataset. Why are we not using those more commonly used datasets? What about Helpfulness and Honesty? does the model maintain fluency and diversity, but regress on those dimensions? 6) The explanation in session 5.2 is not intuitive. As the interventional feedback weight is not based on the reward, but the difference of reward. Comments, Suggestions And Typos: The paper would benefit from ablation study and more discussion around KL penalty within PPO and the introduced "interventional feedback weights". if we could have another model with only Interventional Feedback but no KL penalty, it will address #1 and #2 questions from above. Also the paper would benefit from using more commonly used datasets for RLHF papers including Open Assistant and Anthropic, and more clear discussion on session 5.2 (weight is based on reward difference not reward why it is correlated with prompt toxicity), and the arrows in Figure 3 (rint and r should both be related with LLM init and LLM, similar to LK penalty). Soundness: 3 = Acceptable: This study provides sufficient support for its major claims/arguments. Some minor points may need extra support or details. Overall Assessment: 3 = Good: This paper makes a reasonable contribution, and might be of interest for some (broad or narrow) sub-communities, possibly with minor revisions. Confidence: 3 = Pretty sure, but there's a chance I missed something. Although I have a good feel for this area in general, I did not carefully check the paper's details, e.g., the math or experimental design. Best Paper: No Limitations And Societal Impact: The authors discussed about RW to be ideally not correlated with the input prompts or training data. This is counter-intuitive. Also, does the proposed method only work for debiasing? What about general alignment dimensions like honesty and helpfulness? Ethical Concerns: None Needs Ethics Review: No Reproducibility: 4 = They could mostly reproduce the results, but there may be some variation because of sample variance or minor variations in their interpretation of the protocol or method. Datasets: 1 = No usable datasets submitted. Software: 1 = No usable software released. Author Identity Guess: 3 = From the contents of the submission itself, I know/can guess at least one author's name. Add [–] Reply to Reviewer EhY9 ACL ARR 2023 December Paper1770 AuthorsYu Xia(privately revealed to you) 26 Jan 2024ACL ARR 2023 December Paper1770 Official CommentReaders: Program Chairs, Paper1770 Senior Area Chairs, Paper1770 Area Chairs, Paper1770 Reviewers Submitted, Paper1770 AuthorsShow Revisions Comment: Thank you for your time and efforts in reviewing our paper. We provide the point-by-point responses as follows: Response to Question 1 and Question 2: Thank you for your insightful questions. As described in Section 3.2, the interventional feedback introduced in our method aims to intentionally influence the LLM finetuning towards our alignment objective. The KL penalty, on the other hand, imposes a constraint to ensure that the finetuned LLM retains most abilities obtained during the pretraining stage, which can indeed be considered as some form of general interventional feedback. However, a key difference between them is that the KL penalty is uninformed of any specific alignment goal, while our proposed interventional feedback is able to provide information directionally towards the given alignment objectives. Your observation about their orthogonal dimensions is correct. The KL penalty measures the difference in logits-wise by token probabilities, while our interventional feedback measures the difference goal-wise by reward signals. To account for their differences and to further parallel the objective function, we add the KL penalty to the original reward while utilizing our interventional feedback as sample weights, as discussed in Line 329. Response to Question 3 and Comments: Thank you for your valuable suggestion. We report the results of PPO and CAA with no KL penalty on the test set of RealToxicityPrompts dataset as below: Model Toxicity-R ↓ Perplexity ↓ PPO (kl=0) 0.034 88.92 CAA (kl=0) 0.025 90.23 PPO (kl=0.3) [Reported in Table 1] 0.083 31.34 CAA (kl=0.3) [Reported in Table 1] 0.049 30.91 From the results, we observe that without KL penalty, LLMs quickly diverge towards achieving better toxicity scores but sacrifice their language modeling abilities, leading to high perplexity. Such performance degradation is also observed in [1], which highlights the importance of the KL penalty in PPO fine-tuning. For both kl=0 and kl=0.3, as our method encourages larger reward differences, it achieves a better toxicity score while comparable perplexity score against PPO. This shows that the influences of our interventional feedback and KL penalty operate along different dimensions. Response to Question 4/5 and Comments: Thank you for your comments. In our experiments, we follow [1, 2] for selecting all datasets and downstream tasks, and we report a variety of metrics across different aspects including fluency and diversity. While your suggestion to evaluate our method on additional datasets is indeed valuable, we believe that our current results are still reflective and beneficial in validating the effectiveness of our method. Response to Question 6 and Comments: Thank you for your valuable feedback. In Section 5.2, we show that while there is a general trend that toxic prompts lead to toxic outputs, there are also scenarios where non-toxic prompts lead to toxic outputs because of biases. Our results in Figure 5 intend to demonstrate that our intervention weights can effectively handle such special scenarios. We will include a more detailed explanation in Section 5.2 to enhance the clarity. Responses to Limitations: Thank you for your feedback. For the reward model, our intention is to convey that, ideally, it should not inherit biases from its training data. For instance, if a reward model is trained predominantly on non-toxic samples, its ability to distinguish between toxic and non-toxic content may be compromised, thereby affecting the effectiveness of subsequent RL fine-tuning. We will clarify this point in Section 3.2. Our proposed method is also applicable to general alignment objectives with appropriate reward signals. Given that the motivation for interventional feedback originates from a causal debiasing perspective, we have chosen to focus on debiasing tasks in this paper. Further explorations in other alignment dimensions are planned for future work. [1] Rajkumar Ramamurthy, Prithviraj Ammanabrolu, Kianté Brantley, Jack Hessel, Rafet Sifa, Christian Bauckhage, Hannaneh Hajishirzi, and Yejin Choi. Is reinforcement learning (not) for natural language processing: Benchmarks, baselines, and building blocks for natural language policy optimization. ICLR 2023. [2] Zeqiu Wu, Yushi Hu, Weijia Shi, Nouha Dziri, Alane Suhr, Prithviraj Ammanabrolu, Noah A Smith, Mari Ostendorf, and Hannaneh Hajishirzi. Fine-grained human feedback gives better rewards for language model training. NeurIPS 2023. Add [–] reply ACL ARR 2023 December Paper1770 Reviewer EhY9 29 Jan 2024ACL ARR 2023 December Paper1770 Official CommentReaders: Program Chairs, Paper1770 Senior Area Chairs, Paper1770 Area Chairs, Paper1770 Reviewers Submitted, Paper1770 AuthorsShow Revisions Comment: Thank you for replying back to me, and appreciate the additional data points that you provided on the ablation study. The authors addressed most of my questions. I decide to raise my scores. Add [+] Author-Editors Confidential Comment by Paper1770 Authors • Short Note on Reviewer EhY9's Scores Update [–] Thank you for your reply ACL ARR 2023 December Paper1770 AuthorsYu Xia(privately revealed to you) 30 Jan 2024 (modified: 30 Jan 2024)ACL ARR 2023 December Paper1770 Official CommentReaders: Program Chairs, Paper1770 Senior Area Chairs, Paper1770 Area Chairs, Paper1770 Reviewers Submitted, Paper1770 AuthorsShow Revisions Comment: Thank you again for your valuable feedback and insightful suggestions. We are glad to know that you found our response helpful and sincerely appreciate your decision to raise the scores. As the discussion period is drawing to a close, we wonder if you could kindly reflect the updated scores in your original review. We are more than happy to assist should you have any further questions. Add [–] reply ACL ARR 2023 December Paper1770 Reviewer EhY9 30 Jan 2024ACL ARR 2023 December Paper1770 Official CommentReaders: Program Chairs, Paper1770 Senior Area Chairs, Paper1770 Area Chairs, Paper1770 Reviewers Submitted, Paper1770 AuthorsShow Revisions Comment: done! updated my score to reflect. thank you! Add [–] Supplementary Materials by Program Chairs ACL ARR 2023 December Program Chairs 16 Dec 2023ACL ARR 2023 December Paper1770 Supplementary MaterialsReaders: Program Chairs, Paper1770 Reviewers, Paper1770 Authors, Paper1770 Area Chairs, Paper1770 Senior Area ChairsShow Revisions Responsible NLP Research: pdf Note From EiCs: These are the confidential supplementary materials of the submission. If you see no entries in this comment, this means there haven't been submitted any.