------------------------- METAREVIEW ------------------------ Summary of strengths and weaknesses of the paper: +interesting and important topic, fresh idea - some issues with the dataset and evaluation. Discussion: The reviewers and SPC discussed this paper at length, the key point was whether the freshness of the idea compensated for questionable issues (very) in the data. All appreciated that the authors offered to share the dataset with the community upon publication. This paper was discussed with a second SPC and the PC chairs. Main reasons for final recommendation of acceptance or rejection: This work brings some fresh idea, but we strongly recommend to address the reviewers feedbacks, as the work need to provide more inputs and explanation, inclduding regarding the dataset. ----------------------- REVIEW 1 --------------------- SUBMISSION: 655 TITLE: Addressing Marketing Bias in Product Recommendations AUTHORS: Mengting Wan, Jianmo Ni, Rishabh Misra and Julian McAuley ----------- Paper Clarity ----------- SCORE: 4 (Excellent (Easy to follow)) ----------- Interest to Audience ----------- SCORE: 4 (High) ----------- Paper Significance ----------- SCORE: 3 (Above Average) ----------- Strengths ----------- 1) The paper addresses a highly relevant and important topic of bias in recommenders, particularly marketing bias which relates to user identity characteristics such as body shape and gender. 2) The paper is well written and organized, and lays out very clearly the hypothesis of their study: whether marketing bias exists in real-world datasets, how common recommender algorithms deal with this bias, and whether a proposed method for alleviating such bias can be effective. Each of these areas is systematically explored, including citations of previous work, and conclusions are drawn for each hypothesis. 3) Detailed statistical analysis is performed on the datasets and evaluation results to bolster the significance of the findings. 4) The proposed algorithm Is shown to significantly decrease the bias in recommendations, while retaining fairly close performance in rating prediction and ranking with standard methods. This finding can potentially encourage more studies into this important area of recommender systems research. 5) Detailed implementation details and hyperparameters are documented, and the authors use at least one publicly available dataset (ModCloth), and plan to release the datasets from this work, which will aid in reproducibility, and allow others to continue research in this area. 6) The authors consider two-sided marketplaces, and their market fairness definitions make it possible to assess the fairness to not just users but to product marketers as well. 7) The future work section is very clear about the limitations and assumptions made in this study (e.g. binary user identity, explicit feedback), and provides suggestions for investigating further. ----------- Weaknesses ----------- 1) The authors made some assumptions to derive user identities – for example using the average size purchased by a user as a proxy for their body shape identity, and using consistent purchase history from men’s and women’s clothing categories on Amazon as a proxy for their gender identity. This is understandable given the limitations of the datasets (and the infeasibility of collecting true self-reported user identities for such a large dataset), but this does raise some uncertainty around whether this work can generalize in the real world. For example, a single user account could be used to purchase items for multiple people (e.g. children or spouse) in both clothing and electronics categories, perhaps with different distributions. To their credit they list this as a limitation in the Future Work section. 2) Similarly, some assumptions had to be made to determine the Product Image Group for the electronics dataset: Using the face++ API could in itself introduce biases (which they have attempted to minimize through validation). Categorizing the product image group from the model gender in multiple pictures begs the question of which image was featured as the primary/first image with the product, since many people may not scroll through multiple pictures. And more importantly, by limiting the dataset to products with detected images of humans is limited and could in itself be biased - it would be interesting to study other ways electronic products could be marketed in a biased way, for example with colors, image style and text titles and descriptions (they do address descriptions briefly but this could be expanded in future research). 3) I am curious why they did not also include the RentTheRunway dataset, which is available from the same author, together with the ModCloth dataset, but which is much larger? 4) Figure 4b – If the market segments are sorted based on sizes in training data, why are the black bars (distribution of market segments in the test data) not monotonically increasing? Why is there a difference in distribution in the training and test data? 5) The References section uses non-standard and inconsistent formatting (e.g. “In CIKM”, “In RecSys”). Please follow the ACM style detailed here: https://www.acm.org/publications/authors/reference-formatting 6) Minor typos and edits: In 2 “catelog” should be “catalog”, in 3.2 “Famale” should be “Female”. In 5 “Rating Prediction Fairness”, it should read “…the first market fairness description _is_ indeed consistent with the null hypothesis…” 7) Since the authors have considered multi-sided markets in their analysis, it would be worth reviewing "Towards a Fair Marketplace: Counterfactual Evaluation of the trade-off between Relevance, Fairness & Satisfaction in Recommendation Systems" by Mehrotra et al. ----------- Overall Evaluation ----------- SCORE: 2 (Accept) ----- TEXT: The subject of fairness in recommender systems is highly relevant, and the authors have done a thorough job of analyzing existing bias in datasets, measuring the effect of standard recommender system algorithms on that bias,, including in a two-sided marketplace framework, and proposing a new approach to mitigating that bias. The positive results of the proposed method could give direction to future research in this area. The potential weakness of this paper is the many assumptions that are made in assigning product categories and user identities in the different datasets, as well as the interactions between them. This complex aspect of bias could be better investigated through a user-centric study. However, given the data-driven approach of this paper, and the difficulty of obtaining the true categories and identities for this type of data, these assumptions are understandable, and are mostly clearly called out as limitations. Given the limitations of datasets used, this study does provide valuable insights around how bias manifests in recommender systems, and gives some suggestions on how to minimize this bias. ----------------------- REVIEW 2 --------------------- SUBMISSION: 655 TITLE: Addressing Marketing Bias in Product Recommendations AUTHORS: Mengting Wan, Jianmo Ni, Rishabh Misra and Julian McAuley ----------- Paper Clarity ----------- SCORE: 3 (Above Average) ----------- Interest to Audience ----------- SCORE: 3 (Medium) ----------- Paper Significance ----------- SCORE: 3 (Above Average) ----------- Strengths ----------- A – The paper proposes a new potential bias in recommendation system B – It gives good evidence that the market bias actually exists, and gives a fairness-aware framework to debias C – The paper writing is good and easy to follow ----------- Weaknesses ----------- A – There are some flaws in the experiments and evaluation metrics B – Should add more details about ANOVA model to be more self-contained C – It does not provide the reproductive code ----------- Overall Evaluation ----------- SCORE: 1 (Weak accept) ----- TEXT: In this paper, the authors propose an issue of recommendation fairness called marketing bias which is introduced by different marketing ways, e.g. product images. The authors investigate two real-world datasets and proof that the marketing bias actually exists by doing some statistical analysis on different market segments. This paper further investigates how standard collaborative filtering algorithms react to this biased input data. It addresses this problem by proposing a regularization term to describe the fairness of recommendation and achieves good results. The following are my concerns: 1. For the Electronic dataset, the authors removes all the products attached to “women” or “men” categories. How could we be sure that all the remaining products are all gender neutral? It is possible that a product isn’t gender neutral while not attached to a gender category. The preprocessing of the Electronic dataset needs more clarifications. 2. The evaluation metric on the product ranking fairness is the KL divergence between predicted(recommended) and actual distribution of frequency on different segments. But the actual distribution is already biased, how could that be used as a fair recommendation ground truth? 3. In section 7.1, why just do the experiments (Figure 3 and Figure 4) on traditional CF models? What’s the performance of the proposed fairness-aware framework? To summarize, the paper points out an interesting and important potential bias in recommendation task. It shows the bias exists and proposes a straightforward method to debias. Although it has some flaws in experiments, it is a rather good paper. Minor points: - footnote 2 and 4 are the same. ----------------------- REVIEW 3 --------------------- SUBMISSION: 655 TITLE: Addressing Marketing Bias in Product Recommendations AUTHORS: Mengting Wan, Jianmo Ni, Rishabh Misra and Julian McAuley ----------- Paper Clarity ----------- SCORE: 3 (Above Average) ----------- Interest to Audience ----------- SCORE: 4 (High) ----------- Paper Significance ----------- SCORE: 3 (Above Average) ----------- Strengths ----------- - The paper addresses, for the first time in the literature, the problem of guaranteeing both consumer- and provider-fairness in recommender system; - The problem is tackled by first showing that product recommendations are affected by a marketing bias; - The authors collected datasets that they plan to make available to the research community, who is facing scarcity in datasets collecting sensitive attributes of the users. ----------- Weaknesses ----------- - The presentation of the paper is not clear (see later comments for further details); - The collection of the Amazon dataset mixes products coming from two different departments, which might lead to biased conclusions; - The authors state they deal with consumer- and provider-fairness at the same time, so it’d be good to see the benefits from both perspectives in one plot. ----------- Overall Evaluation ----------- SCORE: -1 (Weak reject) ----- TEXT: The authors present an approach to face marketing bias in recommender systems, which deal with how items are portrayed in online platforms and how this affects recommender systems. The authors collect two datasets from ModCloth and Amazon and show that the marketing bias actually exists, then they analyze its impact on recommender systems, which emphasize it. To overcome this bias, they propose a regularization factor added to the matrix factorization loss function, which ensures that different market segments receive similar recommendation errors (C-fairness) and that the frequency distributions of the market segments in the recommendations reflects that of the observations. The paper deals with very important and relevant issues, and I appreciate the effort of trying to deal with both consumer- and provider-fairness at the same time, in order to make a recommender systems guarantee both. The collection of novel datasets that would be made available to the community is also a very positive side, since most recommender systems datasets do not contain any sensitive features of the users. Unfortunately, the paper, in its current state, has several limitations. First of all, I don’t understand why the Amazon dataset characterizes the product image group in one category (Electronics) and the user identity group in another (Clothing). The two categories are totally different, they reasonably have different targets and, indeed, only 11% of the datasets overlap. I believe that trying to address marketing bias in this way leads to a flawed analysis, since we are mixing different items and different users, in order to characterize a phenomenon. The presentation also has limitations, since it is not clear to me how the C- and P-fairness formulations derived in Equations 2 and 3 have led to the regularization in Equation 7. I was expecting them to be there; maybe they are, but if they are, it is not clear how. Because the authors dealt with consumer- and provider-fairness at the same time, it would’ve been good to see how the regularization provided benefits to both stakeholders (defining a fairness index that captures both properties is a common practice in the literature). ----------------------- REVIEW 4 --------------------- SUBMISSION: 655 TITLE: Addressing Marketing Bias in Product Recommendations AUTHORS: Mengting Wan, Jianmo Ni, Rishabh Misra and Julian McAuley ----------- Paper Clarity ----------- SCORE: 3 (Above Average) ----------- Interest to Audience ----------- SCORE: 4 (High) ----------- Paper Significance ----------- SCORE: 3 (Above Average) ----------- Strengths ----------- + new problem + can open lots of future work/new direction ----------- Weaknesses ----------- + execution needs quite some work + some more explanation needed ----------- Overall Evaluation ----------- SCORE: 1 (Weak accept) ----- TEXT: (feedbacks after discussion with 2nd senior PC) The paper suggests and discusses a novel issue, of marketing bias in recommender system. I think there is value in highlighting new areas to explore, especially those where large-scale data can help test and form hypothesis, and in that sense the paper is valuable to WSDM. Regarding the execution, several flaws can be identified. First, the process of data collection from Amazon is composed of many independent steps (one might even say "hacks"), and the propagated biases and errors could amount to large ones. In particular: - For collecting user gender: it is done by "users’ interactions with amazon’s Clothing products". How was this done, exactly? It seems some scraping activity was performed. But this requires much more detail in the write-up, not only for reproducibility, but also for scrutiny of any bias which might be introduced by the scraping and processing. And given that this is third-party scraping, I imagine the interactions are not page view events, but more likely reviews written by the users. This in itself might be introducing a gender bias, if the propensity to write a review correlates with gender. - For collecting electronics: "We collect all pictures associated with the electronic products". This process needs to be described in more detail. First, for reproducibility. Beyond that, the data collected is relatively small: Table 1 specifies there are under 10,000 items, whereas Amazon has many more. And any index page which a crawler would start from, is likely to also be the results of a recommendation algorithm and business rules. For example, depending on the client country, different items might not be available. The point I'm getting at is that this data might be biased. Going forward, the formulation in Equation 7 is not clear enough with regards to it regularizing for C- and P-fairness. I think the paper needs to discuss this in more detail, and also tie it into the algorithms tested in Section 7. Talking just about the descriptive part, I think the authors did a generally good job describing the problem. At the basic level, one could argue that marketing is a highly optimized process connecting buyers and sellers, and that any biases would have been factored in, or else the marketers would have lost their jobs. For example, in section 4.1, we would trust the marketers to choose whatever images maximize sales - even if those are not gender-neutral, as the authors might expect electronics products should be. But Section 4 in general answers that with a statistical analysis showing the facts are more complicated than that. (Notwithstanding, Figures 3 and 4 are too small and contain too many details to inspire any conclusions.) Good: data preparation of the "electronics" dataset was done with care. Especially impressive is the validation using human labelers. Overall, the paper defines an interesting problem. And makes good progress towards a solution, in particularly in describing it. In the part where it suggests solutions and evaluates them, I feel it could use more clarity and sharpening of the message.