1AC review Expertise Passing Knowledge Originality Medium originality Significance High significance Rigor High rigor 1AC: Recommendation We recommend Revise and Resubmit. 1AC: The Meta-Review Dear authors, thanks for submitting your paper. Overall the reviewers found this paper novel and contributes to the HCI community. We would like you to Review and Resubmit the paper (R&R) according to the following recommendations. - Elaborate and explain the SVM prediction (R1) - Detail the "randomized sequence" and include and discuss the need for a baseline comparison (R1) - Rewrite the algorithm part to make it more precise and more intuitive. - Review and explain the data in Figure 1b (R2, R3) - Explain the results in Figure 4a. (R3) - Include the tech details in the appendix (R2) - Justify the choice of ground-truth answers (R2) - Include and discuss the limitations and inclusion with complementary team performance (R3) - Include the details of the product review evaluation. (R3) 1AC: The Summary of Revisions Required Follow the main revisions to improve the paper: - Elaborate and explain the SVM prediction (R1) - Detail the "randomized sequence" and include and discuss the need for a baseline comparison (R1) - Rewrite the algorithm part to make it more precise and more intuitive. - Review and explain the data in Figure 1b (R2, R3) - Explain the results in Figure 4a. (R3) - Include the tech details in the appendix (R2) - Justify the choice of ground-truth answers (R2) - Include and discuss the limitations and inclusion with complementary team performance (R3) - Include the details of the product review evaluation. (R3) ---------------------------------------------------------------- 2AC review Expertise Passing Knowledge Originality High originality Significance High significance Rigor Medium rigor Recommendation I can go with either Accept with Minor Revisions or Revise and Resubmit. Review The submission presents an evaluation and mitigation of sequential decision making, in the context of algorithm mediation. The setup is novel and important and therefore I recommend accepting. Strength: 1. Anchoring is a well known effect in human decision making. Anchoring bias in sequential decision making is less discussed in the context of algorithmic fairness. But it is important to consider. 2. The methodology is well motivated and well defined. The List of Revisions Required 1. I am a bit surprised that SVM is able to predict college admission with 98% accuracy and product review only 77%. Is this expected? Could you elaborate on the performance here? 2. Not a ML expert myself so maybe this is a naive question - Is there such a thing as baseline for randomized sequence? If yes how does the mitigation strategies reported in the writeup compare to this baseline? Can you elaborate on why randomization would/wouldn't work? ---------------------------------------------------------------- reviewer 2 review Expertise Knowledgeable Originality High originality Significance Medium significance Rigor Medium rigor Recommendation I can go with either Revise and Resubmit or Reject. Review Summary The paper explores anchoring bias in sequential decision tasks. The authors propose two mitigation methods, either to balance anchoring bias after-the-fact or in real-time. Through experiments on committee admission decisions and Amz product preferences, the authors show that they can help people make less biased decisions with respect to groundtruth answers. I like several aspects of this paper: 1. I really like the research direction. balancing anchor bias through reordering seem compelling. 2. I appreciate the large scale study the authors conduct, especially their effort on trying to work with real-world admission data. However, I do have some questions: 1. I wish the authors could provide more intuitions behind the algorithms for the CHI audience. I consider myself having a reasonable amount of knowledge and research experience in AI related fields, but I had a hard time following section 3. 1.1 For example, I was confused by Figure 1b: (a) what's "avg SVM confidence" exactly? I'm assuming it's the P(reviewer decision) from the SVM? (b) how can "average SVM confidence" be lower than 0? (c) when the authors say "exponentially decay", did they try to actually fit the data into a mathematical model, or was it just estimated from the line shape? (d) What exactly does equation 2 mean? 1.2 Then for LSTM + RL, how are the hidden states used exactly? Why does LSTM have to simulate decisions \hat{d}_t_0...t_{i-1}, rather than just t_i? 1.3 I would also suggest replacing the technical details of AC reinforcement learning and DQN approach (e.g., ReLU function, loss functions, L1 loss) with higher level intuitions rationales on why the methods are reasonable. The tech details can be in appendix if they are unique in certain ways, or just removed if they are defaults in the field. 2. I wish the authors could discuss more on experiment setting. I wonder if the groundtruths chosen are reasonable: 2.1 as the authors have already noticed, in the admission task, "there is a dependency between the decisions made and the assumed ground truth". It is also unclear to me how "unbiased" those final decisions were at the first place; Especially because we all know the admission can be somewhat random, I wouldn't be surprised if the results would be swayed if we simply repeated the review without any reordering. 2.2 For product review, I'm not sure the original review rating is a good indicator, as "I'd (NOT) like to read the book" can be affected by many factors not related to book quality. if I'm generally not not be interested in the book genre, I wouldn't want to read the book even if the review seems shiny. Without eliminating these factors, I couldn't decide how to read into the results. Especially because it's unclear how "gold" the groundtruth answers are, I feel the paper may benefit from some quite major submission, and be resubmitted to another conference. The List of Revisions Required 1. Rewrite the algorithm part to make it more clear and more intuitive. 2. Justify the choice of groundtruth answers / change the experiment design. ---------------------------------------------------------------- reviewer 3 review Expertise Knowledgeable Originality High originality Significance Medium significance Rigor Medium rigor Recommendation I can go with either Accept with Minor Revisions or Revise and Resubmit. Review This paper first proposes a novel algorithm to mitigate anchoring bias by retrospectively adjusting a decision with the decision of an unbiased model. The paper then proposes a second algorithm to determine an optimal ordering of instances to proactively mitigate anchoring bias. These strategies are evaluated against two datasets (college admissions and product reviews), and found to somewhat increase accuracy and reduce anchoring bias. This paper importantly contributes to the growing body of literature in human-AI decision-making, specifically toward algorithms that mitigate human biases. However, I found parts the paper to be unclear or unsatisfying: In Figure 1b, why is the average confidence of an (unbiased) method (i.e., the SVM) decreasing with the number of decisions since the last positive evaluation? Does this not suggest some kind of bias, which shouldn’t be the case since the SVM has no notion of ordering of examples? Moreover, since SVM confidence is measured as the distance to the decision boundary, why does the value become negative at >15 examples? The part of the paper that starts to explain this (Section 3.1) is also not entirely coherent (see sentence starting “This is indicated by the fact …”), and doesn’t clarify the situation for me. Could the authors clarify their methodology behind relating the confidence of an unbiased SVM and the anchoring bias of human evaluators? Bias is used throughout the introduction and method sections, but is not operationalized until section 4.3. Perhaps the paper would benefit from a short and precise description of how exactly the anchoring bias was detected, to serve as a grounding and motivating paragraph before diving into the method in Section 3. Then the first sentence of 3.1 could be modified from citing Figure 1a (which feels flimsy since its current presentation feels like that of a “tease figurer”) to this motivating paragraph. From what I understand, the measure of bias in 4.3 (and then the evaluation) also does not seem to be the same as the one used to initially motivate the need for the debiasing methods. Why is this? In Figure 4a, why does the SVM exhibit the same level of bias as the human evaluator? I’m not sure if discussion of a second RL approach is necessary in this paper. Two strategies were implemented and experimented; however, the fundamental method of LSTM+RL remains the same, and I feel like it would simplify the discussion of the method and system diagram (Fig 2.). As a stretch, it may be worthwhile to instead use this space to explore the effects of another method for mitigating anchoring bias; for instance, as the authors describe in 6.2, I would be interested in seeing how the simple user experience modification of presenting crowdworkers with the quantified anchoring state (as a probability) from the PA method to give them an indication of when they may be potentially biased, compares to the other methods. One potential concern I have with the way the proposed methods are evaluated is that the SVM classifier performs (significantly) better than the human evaluators on both datasets. Given the way the PA method aggregates the SVM and human decision, what would be the expected outcome when the AI agent has worse performance than the human agent? Were any experiments considered with complementary team performance, or at least, do the authors envision any limitations in accuracy or bias improvements from either of the paper’s proposed methods given this new team dynamic? Can the authors provide a bit more detail around the product review evaluation? For instance, were crowd workers given completely different review sequences with different product reviews, and/or was there any control in how these examples were distributed? For the methods that were tested with crowd workers (Original, Heuristic, LSTM+DQN, and LSTM+AC), how were the ordered examples from each of these shown to workers? Did workers complete multiple sequences from different algorithms? The paper is overall well-written, though there were several grammatically incorrect or incomplete sentences that can be corrected in a revision. The figures and tables were mostly clear and sufficient. The List of Revisions Required Please refer to the questions in my review. Clarifications based on the questions or concerns raised would be sufficient. ----------------------------------------------------------------