Decision: Accept-Main Comment: The authors propose InterFair, a novel human-in-the-loop pipeline approach to model debiasing, where free-text feedback from humans is used to update importance scores for model rationales. The free-text feedback is parsed by a text-davinci-003 model and transformed into numeric scores, which are then used to update the bias importance of each individual token for a frozen model on the BiosBias classification task. Following the claim that it is challenging to fairly eliminate biases algorithmically, the authors evaluate their scenario at inference-time for two experiments, where users interact with test samples and inputs, showing that both scenarios further mitigate bias from explanations. The reviewers agree that the paper is clearly motivated (beK8, KjGw, mchw), well written and inuitively structured (mchw). Reviewers note that the proposed methodology is novel (beK8) and has a lot of potential (mchw) — and that the presented experimental results support the authors’ approach (beK8, Kj4w). The main criticisms presented in the reviews are lack of details regarding the human study (beK8) and more clarity with respect to the gap to previous work (beK8, Kj4w) as well as the experimental setup (mchw). Furthermore, the choice of a LSTM-based model as part of the pipeline and not a Transformer-based variant limits the scope of the work (Kj4w). Lastly, a question was raised with respect to the choice of natural language feedback instead of using plain token editing (beK8). The authors responded to the questions raised in the reviews, committing to add clarifying details. The authors also presented additional results with a BERT-based model, motivated their choice of free-text input through a user study and clarified the gap with respect to previous work. Upon reading the paper, reviews, as well as the discussion, it is my opinion that the paper is for the most part well written and easy to follow. Furthermore, it tackles a relevant problem in a novel manner. When reading the paper, my main concerns come from a significant reliance on work of He et al, where the reader is assumed to be familiar with the setup — something that can be improved in the manuscript. Another concern was the limited scope of experiments, which is addressed by the authors in the discussion period by adding experimental results on a Transformer-based model. -- Paper Topic And Main Contributions: Firstly, the paper proposes a new pipeline (InterFair) to collect human feedbacks about bias rationales and incorporate that feedback into how the model produces future task rationales and bias rationales. Secondly, the paper does an experiment and a human study to test their proposed pipeline. Reasons To Accept: clear motivation on why human feedback matters novel methodology results show that their proposed method outperforms the no-feedback baseline Reasons To Reject: I find it unclear on why users should provide natural language feedback to be parsed into High/Low/NA importance score per token, if the goal is just to adjust the weights of the already highlighted bias rationales? As I guess changing weights of a few tokens is easier/faster than writing a free-text response (to be later parsed by a GPT-3 LLM) anyway, why don't the authors only keep the option: ``To directly modify the bias rationales, users can increase or decrease the bias importance score for each token accordingly.'' Without more detailed motivation, the choice of natural language feedback may be an unnecessary hurdle for the experiment. Lacking details of human study (e.g. platform, subjects' background in NLP/ML bias, the distribution of feedback style in natural language form vs. in importance weight adjustment form) Questions For The Authors: Question A: Could you please provide some potential explanations on why gradient methods outperform heuristic methods (Table 2)? Soundness: 4: Strong: This study provides sufficient support for all of its claims/arguments. Excitement: 3: Ambivalent: It has merits (e.g., it reports state-of-the-art results, the idea is nice), but there are key weaknesses (e.g., it describes incremental work), and it can significantly benefit from another round of revision. However, I won't object to accepting it if my co-reviewers champion it. Reproducibility: 3: Could reproduce the results with some difficulty. The settings of parameters are underspecified or subjectively determined; the training/evaluation data are not widely available. Ethical Concerns: No Reviewer Confidence: 3: Pretty sure, but there's a chance I missed something. Although I have a good feel for this area in general, I did not carefully check the paper's details, e.g., the math, experimental design, or novelty. -- Paper Topic And Main Contributions: This work proposes an approach to better tradeoff between model debiasing and performance via user interaction on task and bias rationales. Experimental results show that the proposed methods can better mitigate model bias as well as achieving better task performance. Reasons To Accept: Clear motivation with a good showcase of how this work has done to better tradeoff between model performance and bias. Reasons To Reject: Main experiments are conducted on LSTM, but have no idea on how the proposed methods will perform on LLMs, which in my view is more worthy of studying. Meanwhile, this work if quite similar to He et al. 2022, but the author doesn’t clearly explain the major shortcoming of He et al. 2022 that this work can better handle. Questions For The Authors: In line 121, the author mentioned that “the opaqueness of these models hinders faithful perturbation of reationales”, I wonder why LLms hinders perturbation of rational? Soundness: 4: Strong: This study provides sufficient support for all of its claims/arguments. Excitement: 3: Ambivalent: It has merits (e.g., it reports state-of-the-art results, the idea is nice), but there are key weaknesses (e.g., it describes incremental work), and it can significantly benefit from another round of revision. However, I won't object to accepting it if my co-reviewers champion it. Reproducibility: 4: Could mostly reproduce the results, but there may be some variation because of sample variance or minor variations in their interpretation of the protocol or method. Ethical Concerns: No Reviewer Confidence: 4: Quite sure. I tried to check the important points carefully. It's unlikely, though conceivable, that I missed something that should affect my ratings. -- Paper Topic And Main Contributions: This paper introduces the InterFair framework for iteratively incorporating user feedback to mitigate the model's reliance on biased features of the input. InterFair proposes first to parse the open-ended user feedback to a form of per-token bias weights with a large language model (GPT-3) used in a few-shot setup. Second, this feedback is used to update the model rationales through differentiation of the model's generated rationales. Results show that users' interactions with model rationales through InterFair can improve the comprehensiveness and sufficiency of models' rationales, without compromising prediction performance on named entity recognition tasks. Reasons To Accept: The paper is well-written and intuitively structured The motivation of the method's design is clear and well described Incorporating human feedback into models' decision process is a potential, though challenging, research direction Reasons To Reject: My main concern is the underspecification of multiple details throughout the work that are needed to understand the merit of the contribution. Below I list the more important ones, and I follow with assumably easier-to-address points in the "Presentation Improvements" section. Background section does not contain an overview of the previous methods addressing debiasing, and specifically debiasing through rationales. This makes it difficult to assess the relatedness and novelty of the proposed approach. Subsequently, the important point of the motivation is that (L92) "even an algorithmically debiased model can have failure modes" probably refers to the previous methods, which, however, remain unknown to the reader at this point. The authors try to gap this bridge in the brief description of their baseline in the Results section, L256-261 (denoted as Rerank and Adv), but I have trouble grasping how these baselines work, not saying how they relate to InterFair, or whether they are sufficiently competitive. Evaluation, including baselines, is difficult to interpret. The reader is prompted to understand the main metrics of comprehensiveness and sufficiency only from the cited work (L301). After following this order and checking the ERASER benchmark paper (https://aclanthology.org/2020.acl-main.408.pdf), I am convinced of the selection of these metrics in evaluation, but I believe their thorough description would be in place here. I think the current description of the mechanism of incorporating feedback (Sec. 3.3) leaves a lot of space for ambiguity. After multiple reads, I do not feel confident in either the Heuristic nor the Gradient approach description. My questions address this, but the method description, as the main proposition of the paper, should be much more specific. Missing submission of the code or appendices with details currently makes it hardly possible to reproduce and possibly use the method. Questions For The Authors: A: In the Heuristic approach (L190), how are the new task rationales used to generate the new prediction? B: In the Gradient approach, you state that you do not update the model, but only the final hidden states (h) of LSTM. My question is how can you differentiate between per-token hidden states of dimensionality (hidden_state_dim, prediction_length) and rationales of (prediction_length, )? Am I missing something here? Typos Grammar Style And Presentation Improvements: I do not understand the description of "compositional split" on L238: "compositional split 238 where the gold parse has three or more contiguous 239 token sequences". Subsequently, I can not relate to the conclusion on L240: "the compositional split is harder than the IID due to its complexity". I believe the part on L084 would deserve further justification: "He et al. (2022) argue that weighing less on high bias and high-task important tokens and promoting their low-bias replacements can simultaneously address both of the failure modes.". How can weighting less on high-task important tokens address model failures? L202: It seems to be arguable that hidden states (h) are not a part of the model, since you also state that they are used in prediction. L256: perhaps missing "(1)"? Soundness: 3: Good: This study provides sufficient support for its major claims/arguments, some minor points may need extra support or details. Excitement: 4: Strong: This paper deepens the understanding of some phenomenon or lowers the barriers to an existing research direction. Reproducibility: 1: Could not reproduce the results here no matter how hard they tried. Ethical Concerns: No Reviewer Confidence: 2: Willing to defend my evaluation, but it is fairly likely that I missed some details, didn't understand some central points, or can't be sure about the novelty of the work.