================================================== NIPS (Accept) ================================================== Meta-review 1. Please recommend a decision for this submission. Accept (Poster) 2. Please provide a meta-review for this submission. Your meta-review should explain your decision to the authors. Your comments should augment the reviews, and explain how the reviews, author response, and discussion were used to arrive at your decision. Dismissing or ignoring a review is not acceptable unless you have a good reason for doing so. If you want to make a decision that is not clearly supported by the reviews, perhaps because the reviewers did not come to a consensus, please justify your decision appropriately, including, but not limited to, reading the submission in depth and writing a detailed meta-review that explains your decision. Reviews on this paper were of a broad range, so I also read the paper fully and will incorporate my own viewpoint in addition to that of the three reviewers. The paper examines key qualities of disparate learning processes, and demonstrates several potentially surprising qualities. In my view, this is an important paper that does indeed build and state explicitly foundational knowledge for the field, and I think that it will serve as an important cornerstone for future work. The points that R2, the lowest scoring reviewer raised, around potential lack of novelty or complexity are perhaps not overly condemning in this setting -- calling out key qualities of models with fairness constraints and providing simple proofs behind them and simple empirical confirmation is a selling point to me rather than a point of detraction for the paper. Indeed, the clarity of the paper was praised by reviewers, and it is this clarity that will help to make these points widely understood. Finally, I note that the lowest scoring reviewer signaled in discussion that they would be okay with accepting this paper. Thus, I am recommending acceptance. Reviewer #1 Questions 1. Please provide an "overall score" for this submission. 6: Marginally above the acceptance threshold. I tend to vote for accepting this submission, but rejecting it would not be that bad. 2. Please provide a "confidence score" for your assessment of this submission. 2: You are willing to defend your assessment, but it is quite likely that you did not understand central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. 3. Please provide detailed comments that explain your "overall score" and "confidence score" for this submission. You should summarize the main ideas of the submission and relate these ideas to previous work at NIPS and in other archival conferences and journals. You should then summarize the strengths and weaknesses of the submission, focusing on each of the following four criteria: quality, clarity, originality, and significance. This paper tackles a class of algorithms defined as Disparate Learning Processes (DLP) which use the sensitive feature while training and then make predictions without access at the sensitive feature. DLPs have appeared in multiple prior works, and the authors argue that DLPs do not necessarily guarantee treatment parity, which could then hurt impact parity. The theoretical analysis focuses on relating treatment disparity to utility and then optimal decision rules for various conditions. Most notably the per-group thresholding yields optimal rules to reduce the CV gap. As outlined in the beginning of section 4, the theoretical advantages of DLPs seems to optimality, rational ordering, and "no additional harm" to the protected group. In the experimental section, the authors use a corrupted version of one dataset (CS grad admissions) and five public datasets (many of whom are drawn from the UCI dataset collection). For the grad admission dataset, DLP violate within-group ordering and "no additional harm" criteria. Comparing the DLP with a thresholding decision scheme, the authors show that DLPs may cause both accuracy and impact disparity (CV gap) to decrease. I enjoyed this paper and its discussion of treatment parity and impact parity. In particular, the legal background helps ground the competing definitions of parity which therefore extend outward into classes of classifiers. At the same time, parsing out the implied definitions was surprisingly difficult. The paper would be strengthened by formal mathematical definitions of treatment parity and impact parity early in the paper. Without rigorous definitions, it was difficult to align the legal definitions in the introductions and the claims later in the paper. For example, are the criteria listed in the beginning of section 4 included in the definitions of either treatment parity or impact parity -- or simply other fairness criteria that the authors chose to consider? A more minor thought, but the title of the paper asks whether impact disparity requires treatment disparity. This question is not answered in the paper since the paper only addresses DLPs, specifically how DLPs claim to address both impact and treatment disparity but may in fact fail. As I understand it, if there is another method out there that satisfies impact disparity without treatment disparity, we have not yet disproven its existence. 4. How confident are you that this submission could be reproduced by others, assuming equal access to data and resources? 2: Somewhat confident Reviewer #2 Questions 1. Please provide an "overall score" for this submission. 4: An okay submission, but not good enough; a reject. I vote for rejecting this submission, although I would not be upset if it were accepted. 2. Please provide a "confidence score" for your assessment of this submission. 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. 3. Please provide detailed comments that explain your "overall score" and "confidence score" for this submission. You should summarize the main ideas of the submission and relate these ideas to previous work at NIPS and in other archival conferences and journals. You should then summarize the strengths and weaknesses of the submission, focusing on each of the following four criteria: quality, clarity, originality, and significance. This paper provides both a theoretical and empirical analysis of disparate learning processes (DLPs). Disparate learning processes seek to learn "fair" decision rules by taking into account sensitive attributes at training time but producing rules that do not depend on sensitive attributes. Building on existing results, the authors show that for two natural measures of disparity between group outcomes, the optimal way to maximize accuracy subject to a disparity threshold is to set group-specific threshold. They go on to show experimentally the cost of using DLPs, which cannot use information about the sensitive attribute at decision time, compared to group-specific thresholds. As expected, the group-specific thresholds outperform DLPs. Some experiments are done on data that isn't publicly available, though the authors do show similar results on publicly available datasets. The paper is well-written and easy to understand. I find the results in this paper to be underwhelming. The theoretical results follow almost directly from Corbett-Davies et al. (reference [10] in the paper), and they imply that DLPs must be suboptimal, which the experiments go on to show. There is an argument to be made for the value of the experiments demonstrating the degree to which DLPs are suboptimal, and this to me is main contribution of the paper; however, this in itself doesn't seem to be particularly compelling. One interesting direction that the paper touches on is the legal argument for using DLPs. Given that they're clearly suboptimal at reducing the particular forms of disparity being considered here, it's natural to ask why one would consider them at all. Here, the authors claim that DLPs are somehow aligned with the ruling in Ricci v. DeStefano, which set precedent for designing and administering decision-making rules in hiring contexts. This seems to be crucial point: if DLPs are legal in some contexts in which group-specific thresholds are not, then they could be in some sense Pareto-optimal in those contexts; however, if there are no such contexts, then it doesn't seem particularly relevant to investigate how suboptimal they are. This may be outside the scope of this paper, but it would be interesting to see at least some discussion or justification for the use of DLPs. The biggest factor in my review is the lack of novelty in this work -- in my view, most of the theoretical results are easily derived from previous work, and the empirical results are fairly unsurprising. 4. How confident are you that this submission could be reproduced by others, assuming equal access to data and resources? 2: Somewhat confident Reviewer #3 Questions 1. Please provide an "overall score" for this submission. 7: A good submission; an accept. I vote for accepting this submission, although I would not be upset if it were rejected. 2. Please provide a "confidence score" for your assessment of this submission. 5: You are absolutely certain about your assessment. You are very familiar with the related work. 3. Please provide detailed comments that explain your "overall score" and "confidence score" for this submission. You should summarize the main ideas of the submission and relate these ideas to previous work at NIPS and in other archival conferences and journals. You should then summarize the strengths and weaknesses of the submission, focusing on each of the following four criteria: quality, clarity, originality, and significance. Summary: This paper describes a series of negative results for "disparate learning processes" (DLPs). These are recent methods to train classification models whose predictions obey a given fairness constraint between protected subgroups (e.g., by solving a convex optimization problem or preprocessing). The key characteristic of DLPs is that they have access to labels for the protected groups at training time, but not at deployment. In practice, this means that a DLP will produce a single model for the entire population. In this paper, the authors present several arguments against DLPs, including: 1. If the protected attribute is redundantly encoded in other features, then a DLP may result in the same treatment disparity that aims to overcome. 2. If the protected attributed is partially encoded in other features, then a DLP may induce within class-discrimination and harm certain members of the protected group. 3. A DLP provides suboptimal accuracy in comparison to other classifiers that satisfy a specific family of fairness constraints. The authors present several proofs of the theorems for an idealized setting. They complement the theorems with a series of empirical results on a real-world dataset and UCI datasets showing the shortcomings of DLPs and the superiority of simple alternatives (e.g. training a logistic regression model and applying per-group thresholding). Comments: I thought that this is a well-written paper on an interesting and timely topic. I think that it is important to state that the central contribution of this paper is NOT methodological novelty or theoretical novelty, but rather the results against DLPs. I believe that these results will be significant in the long-term as much recent work in this area has focused on tackling fairness-related issues through DLPs without considering their shortcomings. I believe that the current paper would significantly strengthened if the authors were to consider broader approaches to treatment disparity. For example, many readers would consider Dwork et al's proposed approach to train “separate models for separate groups” as a natural extension to the per-group thresholding used in the numerical experiments. The authors currently cite these methods in the text under the “alternative approaches” in Section 2 and the “fairness beyond disparate impact” in Section 5. However, they should be discussed more thoroughly and included in the experimental results in Section 4. Other than this, two other major points to address: - Among the arguments presented against DLPs, 1. deserves some qualification. It is unlikely that a practitioner would not check for that the protected attribute is redundantly encoded before training the model. - The results in Section 3 apply to a setting where the authors have access to the optimal classifier d*. This is fine given that the results are in line with their empirical work in Section 4. However, the authors should explicitly state that this does not capture the real-world setting (i.e., where the classifier is learned from data). There should also be a discussion as to when can should expect the results to manifest themselves in the real-world setting Specific Comments: - The main findings (e.g. points 1. through 4) at the end of the Introduction are great. I would recommend adding a short example after each point so that readers from a broader audience will be able to understand the points without having to read the papers. - Proofs in Section 3 could be put into an Appendix for the sake of space. - The “Complicating matters…” (line 29) in the Introduction is interesting and merits further discussion. I’d suggest placing some of the discussion from lines 318 – 326 here. - “Alternative Approach” in Section 2 and the “Fairness beyond Disparate Impact” in Section 5 should be merged (see also comments above) - Section 2: 0-1 loss should be defined in line 70. I'd recommend changing the Section title to Preliminaries. Minor Comments: - Table 2: the formatting can be improved. The column labels should be in a different - Line 6 “\textit{dis}parity” <- This seems like a typo at first. I'd recommend removing it since it just adds a play on words. - Line 103 "in the next section (\S 4)" <- In Section 4 - Notation: for clarity, I’d recommend using $h$ for the model and \hat{y} for the predicted value. - Line 298 "data sets" <- "datasets" - Line 299 “separating estimation and decision-making” <- “separating estimation from decision-making” - Line 310 “In a recent” <- “In recent” Post Rebuttal ========== I have raised my score as the authors have agreed to address some of the issues raised in my original review. The current submission does not fit the mold of a traditional NIPS paper. However, it provides a crisp and thoughtful discussion on the technical limitations of DLPs. I believe that this makes for an important contribution, especially given that DLPs have become a standard approach to handle issues related to fairness in ML (see e.g. Aggarwal et al. and Kilbertus et al. from ICML 2018). In light of this, I’d encourage the authors to make their work accessible to as broad an audience as possible. Aside from issues raised in my original review, I think one other issue that could be addressed (and brought up by other reviewers) is to answer the question they ask in the title in more direct manner (or, alternatively, to change their title). 4. How confident are you that this submission could be reproduced by others, assuming equal access to data and resources? 3: Very confident ================================================== KDD (Reject) ================================================== Enclosed are the review comments for your paper: 1146:Does Mitigating ML's impact disparity require treatment disparity? Hope the review comments can be helpful! Best regards, Chih-Jen Lin and Hui Xiong KDD 2018 Research Track PC Co-Chairs ----------------------- REVIEW 1 --------------------- PAPER: 1146 TITLE: Does Mitigating ML's impact disparity require treatment disparity? AUTHORS: Zachary Lipton, Alexandra Chouldechova and Julian McAuley Novelty: 4 (Some novelty) Quality: 4 (Good) Presentation: 1 (No) Overall evaluation: -2 (I am not championing but if there is a champion then I am fine accepting) ----------- Strengths ----------- The problem is interesting, important and timely The proposed solution is convincing and the examples compelling The discussion seems well argumented ----------- Weaknesses ----------- the terminology in general, and the distinction between the legal and the technical concepts more in particular are somewhat difficult to catch ----------- Review ----------- The problem considered here is fairness in ML supported decision making systems. As I understand it, the main argument of the authors is that it is more fair to use sensitive features and apply well-designed disparate treatment rather than exclude the sensitive feature at prediction time and reconstruct them based on the other attributes, albeit in an partially incorrect way, which results in causing unintended harm. Besides, it is also more transparent. Overall, the paper is nice to read and appears to make a fairly compelling argument in favor of the use thresholds, which are a technical simpler solution, which actually achieve a better balance between accuracy and fairness in that case, and force one to be more open about the disparate treatment that is being applied. The difference between the legal and technical concepts and the corresponding terms should be explained more carefully. It might help make the problem more concrete to introduce the example with hiring and hair length earlier. The corresponding figure appears early in the paper, but the explanations are only provided much later. The rather pompous style and somewhat confusing discussion might prevent the reader who is not sufficiently familiar with this problem from noticing some weaknesses in the arguments. p.1: In this paper -> no need for bold p.1: referred to as disparate impact and disparate treatment : it seems the stated criteria are requiring NON-disparate impact/ treatment, or that if the criteria is broken, then there is disparate impact/treatment p.1: explain the distinction between the legal and technical concept a bit more. the matching terms are disparate impact / impact disparity and disparate treatment / treatment disparity, not between the parity/disparity, using disparate in one case and parity in the other creates some confusion p.3: expanding the expected accuracy of binary decision rule d(X) : why is it + E(Y) rather than -E(Y) (it does not make a difference for the conclusion) p.6: that quantifies reduction in p-gap ... to the reduction in the CV-gap divided ... the desired CV-score is reached : there seems to be some inconsistency between p-gap and CV-gap, I guess the p-gap should be used since the value p appears in the equations p.6: recall what \hat{p_i} represents ----------- Suggestions ----------- Explain the difference between the legal and technical concepts and the corresponding terms more carefully. Move the example about hiring and hair length earlier. ----------------------- REVIEW 2 --------------------- PAPER: 1146 TITLE: Does Mitigating ML's impact disparity require treatment disparity? AUTHORS: Zachary Lipton, Alexandra Chouldechova and Julian McAuley Novelty: 3 (Incremental) Quality: 3 (Fair) Presentation: 1 (No) Overall evaluation: -2 (I am not championing but if there is a champion then I am fine accepting) ----------- Strengths ----------- 1. Addresses an interesting problem. 2. Provides some analysis. 3. Presents experiments on a few real datasets. ----------- Weaknesses ----------- 1. Clarity could be improved; no typos but quite dense and unclear writing. 2. Use running example to showcase concepts. 3. Not many experiments with real datasets. ----------- Review ----------- The authors claim that the contribution of the paper is to demonstrate the disconnect between satisfying treatment and impact parity for a class of algorithms (DLPs). They show that DLPs don’t provide a good trade-off between impact parity and accuracy particularly when there are attributes that are correlated with the protected attributes. The paper presents some contributions but they are not that groundbreaking or revealing about the problem. I found the paper also quite difficult to read, as the language used is dense and cryptic at times. I think the authors should try to write the paper in simpler language and try to use some concrete running example to guide the user about the contributions of this work and to better motivate the various parity cases. I would also recommend to find some better examples for the experiments as I don’t believe that neither the synthetic nor the small real datasets convince about the impact of this work. Other: “Disparate treatment addresses” -> does it address or does it characterize? same for “Disparate impact addresses…” “a classifier should be blind to the protected characteristic” -> what does this mean? that it should not consider it that feature? ----------- Suggestions ----------- 1. Make writing more clear. Use running examples. Ask a graduate student to read and comment on understanding. 2. More experiments with real datasets. ----------------------- REVIEW 3 --------------------- PAPER: 1146 TITLE: Does Mitigating ML's impact disparity require treatment disparity? AUTHORS: Zachary Lipton, Alexandra Chouldechova and Julian McAuley Novelty: 3 (Incremental) Quality: 2 (Weak) Presentation: 1 (No) Overall evaluation: -4 (I believe this should be rejected) ----------- Strengths ----------- Fairness in algorithmic decision making is a very important, timely topic with potential high social value. ----------- Weaknesses ----------- The paper makes many misleading claims. The experiments and analysis is partly flawed. The language used by the authors is pretentious. ----------- Review ----------- The authors devote their work to argue that a set of recent methods for fair binary classification from the literature [23, 25, 39] break treatment parity when enforcing impact parity. More in detail, [23, 25, 39] argue that since they only use sensitive features during training, not during test, these classifiers exhibit treatment parity during test because two users with the same features by different sensitive features will receive the same outcome. In this work, the authors disagree arguing that these classifiers indirectly use sensitive features during test by employing proxy variable. While the main criticism is something worth discussing, it seems that the disagreement is due to different definitions of disparate treatment. This is something that the authors of the current work quietly ignore and they just take the definition of disparate treatment that serve their argument. More importantly, the current version of the paper contains many misleading claims, the experiments and analysis is partly flawed and the language used by the authors is pretentious. More in detail: - In their introduction, the authors define disparate treatment in the 3rd paragraph without providing any reference to the literature. Since their work argues that a set of previous work has misunderstood what is disparate treatment, I would encourage the authors to back up their definition/assumptions with references. This is contrast with disparate impact, where they provide several references to back up their definition. Moreover, while there is a formal, mathematical definition of disparate impact, the authors never really write a formal definition of disparate treatment. This seems absolutely necessary given the focus of the paper. - The authors make their case based on single piece of recent work from the law [18], rather than a consensual argument from several pieces of work. More importantly, their interpretation seems rather questionable. In particular: (1) [18] tries to debunk the argument that disparate impact may be justified if there is not disparate treatment. In fact, [18] writes "knowingly continuing to use a biased model" constitutes disparate treatment. However, [23, 25, 39] provides mechanisms to ensure there is _not_ disparate impact (there is not bias) and, as an added benefit, they do not use the sensitive features to make a decision during test time. (2) At the end of the introduction, the authors copy a paragraph from [18] to make the case that treatment disparity also exist if done indirectly. However, the paragraph taken from [18] continues by saying (and I quote literally): "The ZPD’s model is more complex, but if that model selects applicants based purely on their species, the ZPD is still effectively engaged in disparate treatment". This weakens their argument significantly since I believe the methods in [23, 25, 39] do not select users "purely" on their sensitive feature. - In the introduction, second page, first column, itemize (2), the authors ignore that [39] provides a mechanism to control the loss of particular users within a class, avoiding to harm them. In particular, refer to section 3.3 in [39]. - The authors decide to name the methods their paper focuses/critizes on as "Disparate learning processes" (DLPs). This appears to me highly pretentious -- the authors may of course make the case that those methods incur into disparate treatment in this paper, however, that is their point of view and seems inadequate to name them according to their criticism. - In the last paragraph of section 2, the authors argue that "under some circumstances, DLPs can (indirectly) realize any function achievable through treatment parity". Surprisingly, "under some circumstances" refer to the situation when the sensitive features can deterministically obtained from another feature or features, which seems a _very restricted_ scenario, highly implausible in practice. - At the beginning of section 3, the authors include (3) to strengthen their case, however, (3) is referred in the next section and is not theoretically backed up. Moreover, the authors still insist on including a section titled "Within-class discrimination when protected characteristic is partially redundantly encoded" at the end of section 3, which basically refers to section 4 again. This may be misleading and I would suggest that such empirical finding should only appear in section 4. - In section 3, under "Treatment disparity is optimal", the claim is only true under perfect knowledge of the distribution, which has been shown to be a result of limited value by Woodworth et al. (Learning Non-Discriminatory Predictors, https://arxiv.org/pdf/1702.06081.pdf). Moreover, section 3 just constains small variations and incremental results based on [11] rather than novel results, as the authors seem to imply in the introduction. - The authors claim that [32] is concurrent independent work in the second column of page 3 (section 3), however, [32] was uploaded to arxiv on May 25, 2017, while a first version of the authors' work was uploaded to arxiv on Nov 19, 2017. It may be the case that the authors were unaware of [32] and independently developed their work, however, I would suggest to avoid writing it is "concurrent" when there is a 6 month gap. - The theoretical analysis seems partly flawed for the following reasons. The authors argue that DLPs methods use p(y | x) to take a decision and thus is suboptimal in comparison with the bayes-optimal decision rule which uses p(y | x, z) because it does not use z. However, DLPs do not use p(y | x) but instead build a function \hat{p}(y | x) (which in fact is not necessarily a probability) using the knowledge of z during training. As such, it is unclear whether it is possible to make a rigorous theoretical statement. - In practice, one needs to approximate p_{y | x, z) using data, however, if there is a minority for some value of z, such approximation can be arbitrarily bad and thus the "optimal" decision rule, when using sample estimates, perform arbitrarily bad. This is one of the arguments made by Woodworth et al., referred above. - The synthetic example provided in 4.1 is all about individual fairness and monotonicity rather than illustrating disparate treatment. The methods in [23, 25, 39] do not claim individual fairness nor monotonicity and thus the whole argument seems misleading. Moreover, the authors build an example in which a non relevant feature for the task is added (hair), which would be unlikely to be used in practice. As a consequence, the example seems highly "synthetic" to artificially make their case. - In their synthetic example, the authors argue about non monotonicity, however, they quietly ignore that when using two group based thresholds, men and women would be judged under different standards (on what is work experience). They should also discuss this. - In 4.2, the corruption the authors add to the data does not take into account how close they are from the boundary. This seems essential since clear cut cases are less likely to be flipped than unclear cases. - The rule they use in 4.2.1, defined by the itemizes (1)-(2), is flawed. In particular, it requires to know the number of examples in group a and group b in order to set c_i, however, during test time, one cannot really know how many examples one will need to decide upon. Without that knowledge, the performance of their rule will be certainly more underwhelming. - The authors write that in section 4.2.1 after the itemizes that "These scores do not change after each iteration. So the greedy policy is optimal.". They should elaborate further, I do not see why the former implies the latter. - The conclusion in the last paragraph of section 4.2.1 is questionable given that their rule use knowledge of the number of examples in group a and group b during test time. As a consequence, I do not think the paper is ready for publication at KDD. ----------- Suggestions ----------- Take into account my detailed comments to fix many of the flaws and misleading claims. ------------------------- METAREVIEW ------------------------ PAPER: 1146 TITLE: Does Mitigating ML's impact disparity require treatment disparity? RECOMMENDATION: reject The reviewers appreciated the authors' effort on the critical problem of fairness of data mining. The work is timely and has lots of promises. However, there are many concerns raised that prevent the paper from being accepted in its current form. We encourage the authors to consider the detailed reviews, address them, and make a submission to another top venue. ================================================== FAT (Reject) ================================================== Dear authors, The FAT* Conference program committee is sorry to inform you that your paper #14 was rejected, and will not appear in the conference. Title: Can ML methods reduce disparate impact without disparate treatment? Authors: Zachary Lipton (UCSD) Alexandra Chouldechova (Carnegie Mellon University) Charles Elkan (UCSD) Julian McAuley (UCSD) Paper site: https://fatconf18.ccs.neu.edu/paper/14?cap=014aVmloOhd-6WE We received 116 paper registrations and 72 full paper submissions and were able to accept 17 of those (24%) for presentation at the conference. We appreciate your willingness to share your work with us via submission, and hope that the reviews you receive below will provide you with useful feedback as you revise this submission for presentation elsewhere. We hope that you will still consider attending the FAT* Conference, February 23rd and 24th in NYC. Registration information will be posted to fatconference.org soon. Sorelle and Christo FAT* PC Co-chairs --- REVIEWS --- Review #14A =========================================================================== Overall merit ------------- 2. Weak reject (This paper should be rejected but I will not fight strongly) Reviewer expertise ------------------ 4. I know a lot about this area Paper summary ------------- This paper argues against recent methods that aim to fit models that reduce disparate impact without disparate treatment. These methods, which are referred to as "Disparate Learning Processes" in the text (DLPs), fit a model by solving an optimization problem that satisfies fairness constraints between subgroups (e.g. equal FPR). The authors argue that DLPs are not need because: (1) for any setting where a DLP model could be used, there exists a "Disparate Treatment" (DT) model that attains the same accuracy; (2) the DLP model may have unintended effects (e.g. they may induce within-class discrimination). In such cases, a DT model would be better, because it mitigates these effects, and "promote[s] equality more transparently through direct disparate treatment, rather than through hidden changes to the learning algorithm." Using two case studies, the authors show that they can construct such DT model that achieves the same accuracy/fairness performance as a DLP model from [ZVGRG 17]. Strengths --------- The main strength of this paper is its message. Specifically, it highlights important shortcomings of DLP methods, and offers a high-level criticism of the way that most ML methods deal with fairness (i.e. a key part of the argument is that we should be designing methods that promote equality without hidden changes to the learning algorithm). I was unaware of some of the shortcomings that are discussed in this work. Having read this paper, and I am now convinced that there may be cases where the DT approach of the authors would be preferable, and that methods to promote equality without hidden changes are generally preferable. Weaknesses ---------- The paper argues that DLPs should not be used because, in theory, a DT model could attain the best accuracy/fairness trade-off in a way that is more transparent. Putting aside assumptions about how the data is collected, this entirely ignores the ability of current DLP/DT methods to learn models from data. Put simply, I would be willing to scrap DLP approaches all together, but there is no method that can consistently output models that attain the theoretical guarantees in Section 4 on real datasets. This issue is only tackled in a sentence at the end of Section 6.1 ("It is not always clear how best to realize these gains in practice, where imbalanced or unrepresentative data sets can pose a significant obstacle to accurate estimation."). The case study in 4.2 provides some empirical evidence that a DT approach may be preferable, but it's a huge stretch to say that DLP approaches should be scrapped altogether based on 1 comparison on a real-world dataset, especially given that the comparison is made with a DLP method of [ZVGRG 17] (which, as the authors themselves point out, does not optimize the 0-1 loss, or directly account for disparate treatment). Comments to authors ------------------- - I don't see we can consistently advocate the use of a DT approach over a DLP approach without one or more of the following: (i) some theoretical results that take into account estimation error; (ii) a DT approach that realizes the gains in practice; (iii) a set of experiments that compare the current DT approach against several DLP approaches on several datasets. - In general, I thought that the message was distorted (not sure if this is intentional). I would advise qualifying some of these statements. For example, the heading "Disparate Treatment is Optimal" should really be "Disparate Treatment May Be Optimal" because the result shows that you could model the optimal classifier using different thresholds, but there is no method to fit this classifier from data. - The core message is stated nicely in the abstract. However, it gets lost in the paper (e.g. do we really need 3 pages of introductory material + 1.5 page of discussion?). - It would be helpful to given some intuition about "immediate utility" to prevent readers from having to dig out the definition [CDPF+17]. It would also help to provide an alternative definition in terms of class-based accuracy metrics (I got u(d,c) = (1-c) x [p(TPR - FNR) + (1-p)(TNR-FPR)]) where p = P(y=1)") - I think that the following point from Section 4 strengthens your argument considerably: ``Because the CV score and p-% rule are non-convex in model parameters (scores only change when a point crosses the decision boundary), [KAS11, ZVGRG17] introduce convex surrogates aimed at reducing the correlation between the sensitive feature and the prediction." Based off this, even if the approach you are advocating wouldn't be able to fit a perfect classifier, it may be possible for these methods to do worse due to the use of convex surrogates - correct? Minor Changes: - 6.3 Separating Estimation and Decision-Making should be "Risk Assessment vs. Decision-Making." - 1.3 A Note on Organization should be "Organization" - Corollary 2: point (3) should be omitted left out of the statement. - Theorem 4: typo in line 1: "under a the" - Consistency: decision-making vs. decision making (use the former) - ``It is worth noting that an algorithm that disproportionately fortunes the dominant class is more likely to raise red flags (under disparate impact) than one that effects representative outcomes. " <- 'affects' * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * Review #14B =========================================================================== Overall merit ------------- 2. Weak reject (This paper should be rejected but I will not fight strongly) Reviewer expertise ------------------ 3. I know the material, but am not an expert Paper summary ------------- The authors make the case that the best way to avoid disparate impact in algorithmic decision making is simply to apply appropriate affirmative action (aka reverse discrimination) to the disadvantaged group. The main arguments are: this achieves the optimum loss subject to fairness constraints; the intervention may be more interpretable than typical learning schemes subject to fairness constraints (termed DLPs); DLPs may induce intra-group discrimination by applying 'affirmative action benefit' unevenly. Strengths --------- The discussion in 6.1 and 6.2 provides interesting perspective on the legal position, and on the potentially distortive effects of typical fairness constraints. Weaknesses ---------- This topic has been previously discussed, e.g. Dwork et al. 2012 "Fairness through awareness" Sec 4, not referenced here. As mentioned on p.2, "...there is also popular resentment of affirmative action and its future legality remains contested [BS16]" - additional comment would be welcome. The main technical result is not a great surprise. More discussion of implementation would be helpful. Which dataset is used in 4.2? Comments to authors ------------------- Minor typos on p.4 into to groups -> into two groups the the positive -> the positive * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * Review #14C =========================================================================== Overall merit ------------- 4. Accept (Good paper, it belongs in the conference) Reviewer expertise ------------------ 4. I know a lot about this area Paper summary ------------- This paper sets out to address a fundamental question at the heart of much work on fairness in machine learning: is it possible to avoid engaging in disparate treatment when trying to address disparate impact? The paper specifically addresses what the author(s) call(s) ‘disparate learning processes’ (DLPs), in which the learning *process* considers protected characteristics, but the resulting *model* itself does not — a process by which researchers have tried to avoid disparate treatment in attacking disparate impact The paper first establishes that legal precedent does not completely forbid the use of disparate treatment in the service of minimizing disparate impact. At the same time, it explains that while the resulting model may not consider protected characteristics, the process by which the model was developed _does_ and that courts might see this as a form of disparate treatment anyway. And where protected characteristics are fully encoded in other features, the learning process will simply find perfect stand-ins to accomplish disparate treatment anyway. It then proves and provides an example to show that disparate treatment will always generate the optimal trade-off between reduced disparate impact and accuracy — something that DLPs cannot achieve when protected characteristics are only partially encoded in other features, Under these circumstances, DLPs can also generate models that result in irrational ordering and disparate treatment that benefits the advantaged group. Finally, the authors argue that there are practical reasons to favor explicit and direct disparate treatment over DLPs, as this grants decision- and policy-makers a better way to understand the trade-offs involved and therefore engage in more precise normative debate. Strengths --------- This is an ambitious and well executed paper that sets out to tackle a latent problem at the very core of the field of fairness in machine learning. It demonstrates command of both the technical and legal scholarship and raises a number of urgent questions for the field to address, including whether it even makes sense — both normatively and technically — to try to avoid disparate treatment when attempting to minimize disparate impact. The problem is well formulated and nuanced. And the paper does a great job uncovering incoherence in current approaches — as well as their practical deficiencies. Weaknesses ---------- That said, the paper may unfairly stack the deck in its favor by focusing on situations where the data generating mechanism is well understood (i.e., where we do not have reason to believe that the training data encodes bias that we are unable to measure). The paper makes this explicit at times, but it strikes me as yet another fundamental question for the field: to what extent are these interventions designed to compensate for un-quantifiable, but suspected bias encoded in the training data and to what extent are they designed to affirmatively promote equality and diversity? The answer to this question will often determine whether an intervention is understood as disparate treatment or not. In practice, interventions are likely engaged in both of these things — and this is something that I would have wanted the authors to consider. I would also welcome some further discussion of the intuition or impulse behind aversion to disparate treatment as a mechanism to address disparate impact. I would venture to say that what some might see as problematic about DLPs is that they are *too* systematic in the way they seek out the alternative model that reduces disparate impact. The standard approach to disparate impact cases in the law is to ask whether some alternative practice could serve the same goal equally well, while reducing the disparity in outcome. One could imagine that such alternatives were identified historically by simply adjusting the selection mechanism in ways that were expected to continue to meet the intended goal and then observing whether, in test cases, they created a different distribution of outcomes for members of protected groups. It’s worth considering how and why this is different — both normatively and functionally — from DLPs. My sense is that finding these alternative practices was no t seen as a type of disparate treatment because the resulting minimization of disparate impact felt more incidental or indirect, but the analogue to non-machine learning decision-making seems worth exploring much further. Finally, I think the authors are a bit too optimistic about the prospect that setting different thresholds for different protected classes — while both a better mechanism for addressing disparate impact and somehow more honest — would pass legal muster. In fact, I suspect that what the author(s) see(s) as the practical value of forthrightness in the use of different thresholds is precisely what would invite a disparate treatment claim. I doubt the EEOC, for example, would countenance hiring rules that explicitly differed for men and women. I think the paper explains why this is an incoherent or indefensible position, but I think that likely reaction is worth considering and explaining.