Paper Decision DecisionProgram Chairs21 Jan 2024, 23:40 (modified: 22 Jan 2024, 22:40)Program Chairs, Senior Area Chairs, Area Chairs, Reviewers, AuthorsRevisions Decision: Accept Comment: The paper proposes a novel approach to off-policy estimation by reducing variance through policy convolution. Reviewers find the idea highly novel and interesting. The author response is well-argued and thorough. One reviewer is concerned about relevance to the UMAP track. While I do not share this particular concern, I do agree with the reviewer's sentiment that the connection to recommendation systems can be easily missed if not reading the paper carefully. To this point, I find the discussion of the MovieLens experimental setup from lines 489-508 to be rather opaque and technical without carefully discussing the motivation for this setup or justifying why this experiment in particular is the experiment the authors should be running with actual recommendation data. I remark that three reviewers question the rationale for using MovieLens (only) and how the setup relates to recommendation. The authors provide useful justification in their rebuttal that shuold be incorporated into the experimental design description along with more explanatory detail (not just what was done, but why it was done). In addition, a large number of other points of clarification and minor correction came up in the reviews and I think it is critical for the authors to incorporate their responses and suggested changes on revision. All of these revisions seem straightforward and only relate to presentation. With revisions as suggested above (and in the reviews), I believe this paper can make a solid and novel contribution to the WWW UMAP track. Overall response to all reviewers Official CommentAuthors (Dawen Liang, Julian McAuley, Nathan Kallus, Noveen Sachdeva, +1 more)10 Dec 2023, 13:10Program Chairs, Senior Area Chairs, Area Chairs, Reviewers, Reviewers Submitted, Authors Comment: We thank the reviewers for their thoughtful reviews and valuable feedback. We will revise the paper accordingly. We respond to the questions and comments raised by individual reviewers, one-by-one below. Please note that the WWW general chairs have disallowed authors to upload a revised PDF during the rebuttal phase, due to which we promise to account for the suggestions in our next revision. Official Review of Submission1159 by Reviewer MJi1 Official ReviewReviewer MJi130 Nov 2023, 20:13 (modified: 01 Dec 2023, 22:27)Program Chairs, Senior Area Chairs, Area Chairs, Reviewers Submitted, Reviewer MJi1, AuthorsRevisions Review: Summary This paper presents a new approach to the problem of off-policy evaluation (OPE) in the contextual bandit setting. Traditional methods for OPE, such as Importance Sampling, struggle with high variance when dealing with large action spaces or significant policy mismatches. The authors propose the Policy Convolution (PC) estimator to address these challenges. PC uses action embeddings to leverage latent structures within actions, hypothesizing that similar embeddings correlate with similar expected rewards. The goal of PC is to create a more favorable bias-variance trade-off by convolving the logging and target policies. The paper tests the PC estimator on both synthetic and real-world datasets, focusing on its performance in reducing mean squared error under conditions of large action spaces or high policy mismatch. Strengths The concept of the Policy Convolution (PC) estimator is novel and addresses a well-defined gap in off-policy estimation tasks. The motivation behind the approach is clearly articulated, demonstrating a deep understanding of the challenges in this domain. The paper presents extensive ablative and controlled evaluations, showcasing the robustness of the PC approach in various off-policy estimation scenarios. The paper is well-structured and presents its ideas in an intuitive and straightforward manner, making it accessible and easy to comprehend. The logical flow and clarity of the presentation facilitate a quick grasp of the core concepts and methodologies. The assumptions underlying the PC approach are discussed well, providing clarity on the conditions and contexts in which the approach is most effective. This discussion helps in understanding the scope and limitations of the proposed method. Weaknesses Presentation clarity: The figures in the paper are cluttered and difficult to interpret, with too many lines that obscure the underlying trends. Abstract specificity: The abstract does not clearly state that the paper's focus is exclusively on the contextual bandit problem. Theoretical justification: The claim that "offCEM" and "groupIPS" are special cases of PC lacks sufficient theoretical substantiation or derivation, at least in the appendix. Baseline comparisons: The evaluations primarily compare PC variants to basic policy estimators without considering baselines that also use latent action information, which is a key aspect of PC. A bit more detailed comments on the cons Firstly, the figures presented are somewhat cluttered and challenging to interpret, with an excess of overlapping lines that make it difficult to discern specific trends. A more streamlined approach to data visualization could greatly enhance the clarity of these results. Additionally, the paper's claim on page 4 that "offCEM" and "groupIPS" are special cases of the proposed PC method lacks comprehensive theoretical substantiation. Including detailed derivations or expanded explanations, possibly in the appendix, would strengthen this assertion. Lastly, while the evaluation compares PC variants to standard policy estimators, it overlooks the comparison with other methods that also use latent action information. Such comparisons, particularly with "offCEM" and "groupIPS," would provide a more robust validation of PC's effectiveness in contexts where latent action information is utilized. Questions: Could you provide a formal derivation to clarify the relationships between PC and other methods, i.e., offCEM and groupIPS? This would help understand the theoretical underpinnings and the distinctions or similarities between these approaches. In Equation(1), the term is used for the logging policy . However, this term isn't defined elsewhere in the paper. Is this a typo of ? In Figure 2, several lines indicate a decrease in MSE as the action space size increases. This doesn't seem straightforward; could you explain why and how this trend occurs? Regarding the "DR Backbone" in Figure 2, there's a noticeable peak in MSE for the kNN PC variant. This peak represents a significant increase in MSE. What could be the underlying reason for this anomaly? On page 6, the discussion about Figure 4 mentions that "as continues to increase, PC converges to its respective backbone estimator." However, this statement isn't clearly reflected in the figure. Could you provide further explanation or clarification on how this conclusion was drawn? The paper assumes that actions with similar rewards will be closer in the latent space. How would the efficacy of the PC be affected if this assumption does not hold? Discussing the impact of embedding quality on PC's performance would be valuable in understanding its robustness under varying conditions of the latent space. Ethics Review Flag: No Ethics Review Description: N/A Scope: 4: The work is relevant to the Web and to the track, and is of broad interest to the community Novelty: 6 Technical Quality: 5 Reviewer Confidence: 4: The reviewer is certain that the evaluation is correct and very familiar with the relevant literature Response to reviewer MJi1 [Part 1/2] Official CommentAuthors (Dawen Liang, Julian McAuley, Nathan Kallus, Noveen Sachdeva, +1 more)10 Dec 2023, 13:36Program Chairs, Senior Area Chairs, Area Chairs, Reviewers Submitted, Authors Comment: We really thank you for your time and effort in reviewing the paper and providing us with your valuable feedback! Below are the responses to your important and well-needed questions, and please let us know if we can add / expand on something: Q1: The figures in the paper are cluttered and difficult to interpret. A1: We have replotted the figures with improved shading and coloring strategies to improve their readability. We will update these figures in the revision. Q2: The abstract does not clearly state that the paper's focus is exclusively on the contextual bandit problem. A2: We will make this clear in the abstract in the revision. Q3: offCEM and groupIPS are special cases of PC lacks sufficient theoretical substantiation. A3: Thank you for bringing this up. We will make our claim more formal in our next revision’s appendix, in addition to the existing explanation in lines 397-400. Since this won’t take too much space, we prove below our claims formally, how groupIPS and offCEM are special cases of PC, which we call PC-IPS and PC-DR respectively, with a specific kind of convolution function: in groupIPS: is a single-depth tree pooling function (equivalent to one-level of clustering): where represents the cluster of the action . Q4: Evaluations primarily compare PC variants to basic policy estimators. A4: While we don’t explicitly plot the results for other embedding-based estimators (namely, MIPS [1], groupIPS [2], offCEM [3], similarity estimator [4]); as we specified in lines 394-405 (and expanded on in Q3): (1) MIPS reduces to IPS in our deterministic action embedding mapping; and (2) groupIPS, offCEM, and similarity estimator are specific instantiations of PC. Further, in Figure 6, we explicitly point out the advantage of PC’s flexible levels of convolution (i.e., ) rather than groupIPS or offCEM where or the similarity estimator where , where PC is able to balance the bias-variance tradeoff better. Q5: q(x,⋅) isn’t defined. A5: Thank you for pointing out this typo. You’re right that here should be . We’ll fix this in our revision. Q6: In Figure 2, several lines indicate a decrease in MSE as the action space size increases. A6: Great observation! The simple explanation to this phenomenon is our synthetic data setup’s underlying stochasticity while defining , namely and (Section 4.1). Due to this stochasticity, there is no strict guarantee of an increasing MSE with increasing . However, we do indeed note this phenomenon, on average, across a variety of OPE estimators due to the implicitly increasing difficulty in OPE. Q7: Regarding the "DR Backbone" in Figure 2, there's a noticeable peak in MSE for the kNN PC variant. A7: Great observation! We attribute this behavior to the same stochasticity in our data setup, as explained previously in Q6. This leads to their being no strict ordering of estimator performance for each experiment (e.g., varying action-sizes), but only valid, on average. Replying to Response to reviewer MJi1 [Part 1/2] Response to reviewer MJi1 [Part 2/2] Official CommentAuthors (Dawen Liang, Julian McAuley, Nathan Kallus, Noveen Sachdeva, +1 more)10 Dec 2023, 13:37Program Chairs, Senior Area Chairs, Area Chairs, Reviewers Submitted, Authors Comment: Q8: As continues to increase, PC converges to its respective backbone estimator isn’t supported in Figure 4. A8: Great observation! There are two things at play for your observation: (1) as described in lines 553-554, the figures only include results for PC with a positive amount of pooling (i.e., ); and (2) due to computational limits, we only scale to 25k in Figure 4. We explain the effect of both things below: Since PC directly encompasses its relevant backbone estimator (when ), whenever the backbone estimator performs better than PC, in reality, PC and its backbone have converged. This is a different way of saying that performing no convolution is actually the best decision in this scenario. Looking at Figure 4(b), SNIPS performs worse than all pooling functions for PC-SNIPS when , whereas becomes better than 3 out of 4 pooling functions when . In the limit of infinite data and reasonable assumptions, naive propensity based estimators are minimax optimal [5]. Hence, the PC family of estimators will most definitely converge to the baseline estimators with increasing . However, due to computational limits, we could only scale to 25k. Q9: The paper assumes that actions with similar rewards will be closer in the latent space. How would the efficacy of the PC be affected if this assumption does not hold? A9: Great question! If two actions that are close in the latent space do not have similar rewards, then the PC family of estimators will incur a large bias (at the cost of a smaller reduction in variance leading to a worse MSE), since PC effectively imputes reward for an action using other actions that are closer to it. So, it is crucial to verify this assumption before employing PC for OPE. We will add this important discussion in our next revision. Finally, we would like to end by requesting you to kindly reconsider your final ratings as it strictly decides the outcome of this paper, which in your own self-opinion is novel, well-motivated, and contains strong empirical evidence. [1] Saito, Yuta, and Thorsten Joachims. "Off-Policy Evaluation for Large Action Spaces via Embeddings." International Conference on Machine Learning. PMLR, 2022. [2] Jie Peng, Hao Zou, Jiashuo Liu, Shaoming Li, Yibao Jiang, Jian Pei, and Peng Cui. Offline policy evaluation in large action spaces via outcome-oriented action grouping. In Proceedings of the ACM Web Conference 2023. [3] Saito, Yuta, Qingyang Ren, and Thorsten Joachims. "Off-Policy Evaluation for Large Action Spaces via Conjunct Effect Modeling." arXiv preprint arXiv:2305.08062 (2023). [4] Nicolò Felicioni, Maurizio Ferrari Dacrema, Marcello Restelli, and Paolo Cremonesi. Off-policy evaluation with deficient support using side information. Advances in Neural Information Processing Systems, 2022. [5] Wang, Yu-Xiang, Alekh Agarwal, and Miroslav Dudık. "Optimal and adaptive off-policy evaluation in contextual bandits." International Conference on Machine Learning. PMLR, 2017. Official Review of Submission1159 by Reviewer jX3e Official ReviewReviewer jX3e29 Nov 2023, 23:22 (modified: 01 Dec 2023, 22:27)Program Chairs, Senior Area Chairs, Area Chairs, Reviewers Submitted, Reviewer jX3e, AuthorsRevisions Review: Off-policy estimation (OPE) is a widely adopted approach in reinforcement learning that encounters the challenging of distribution shift between the logging policy, responsible for generating the data, and the target policy, particularly in scenarios with a large action space. In order to tackle this challenge, the present study introduces a novel Policy Convolution (PC) estimator that leverages latent action structure defined by action embeddings to enhance off-policy evaluation. Pros: The draft is well-organized, providing concise and clear explanations of related works and OPE estimators. The experimental design is elaborate, with clearly defined objectives. The titles in Sec 4.2, formulated in the form of a problem, are easily understandable. Cons: The presentation of the proposed method in this paper is quite brief, with Section 3 comprising only about one page. It is recommended to consider elaborating on the theoretical analysis of the method's "bias-variance trade-off" and providing a more comprehensive description of the motivating example. The experimental line chart in this paper contains multiple lines, and the lines are visually indistinguishable, making it difficult to differentiate between the proposed method and the baselines. It is recommended to consider using dashed lines for the baselines to differentiate them from the solid lines representing the proposed method. Questions: The experiments in this paper were conducted on only two datasets, with one of them being a self-constructed dataset. It is desirable to conduct experiments on a wider range of datasets to demonstrate the applicability of the proposed method. The performance of the four pooling functions varies in different scenarios (i.e., different datasets and backbones). Some choices may appear to perform worse than the baseline. In practical usage of the PC estimator, how should one choose the pooling function? How does the proposed method perform in terms of computational efficiency? Does it incur a significant time cost compared to the baseline? Ethics Review Flag: No Ethics Review Description: No Scope: 3: The work is somewhat relevant to the Web and to the track, and is of narrow interest to a sub-community Novelty: 4 Technical Quality: 4 Reviewer Confidence: 3: The reviewer is confident but not certain that the evaluation is correct Response to reviewer jX3e [Part 1/2] Official CommentAuthors (Dawen Liang, Julian McAuley, Nathan Kallus, Noveen Sachdeva, +1 more)10 Dec 2023, 13:39Program Chairs, Senior Area Chairs, Area Chairs, Reviewers Submitted, Authors Comment: We really thank you for your time and effort in reviewing the paper and providing us with your valuable feedback! Below are the responses to your questions, and please let us know if we can add / expand on something: Q1: Consider elaborating on the theoretical analysis of the method's bias-variance trade-off and providing a more comprehensive description of the motivating example. A1: Thank you for raising this point. We would like to point out that PC is a vast family of estimators with (1) varying pooling functions; (2) different levels of convolution; and (3) different backbone estimators. This makes it challenging to develop rigorous theoretical analysis on the bias and variance in general, since they might well depend on different components we use in PC. Thus we leave it for future work. That said, many existing estimators, namely groupIPS [1], offCEM [2], and the similarity estimator [3] are all specific instantiations of the PC family of estimators (see Q3 in our response to reviewer MJi1 for a formal statement). And their theoretical results also apply to special cases of PC. Regarding the motivating example, we’ll make sure to add more details in the main-text in our next revision as you suggested. Q2: The figures contain multiple lines and aren’t clear. A2: Thank you for raising this point. We agree that having better shading and coloring strategies will certainly improve the readability of our plots. Taking your advice, we have already re-done all our plots with a significantly improved readability, and will update them in our next revision. Q3: Experiments were conducted on only two datasets, with one of them being a self-constructed dataset. A3: The goal of our experiments is to test the internal and external validity of PC. We test internal validity of PC on the self-constructed dataset (as is the case with almost all existing OPE papers), where we can explicitly control the off-policy evaluation instance setup. We test the external validity of PC on movielens dataset, where we try to mimic different real-world OPE scenarios (e.g., with embedding derived from data), encompassing different (1) reward functions, (2) numbers of actions, (3) amounts of bandit data (4) policy mismatches (5) action embedding size. Please see Figures 2 to 31 to appreciate the complexity of our empirical evaluation. To make the scale of the empirical evaluation manageable, we choose to vary different OPE scenarios on one of the most representative recommendation dataset movielens, rather than on multiple datasets but only on specific settings. We would argue that prior works either only test on datasets that are less reflective of real-world scenarios (e.g., converting classification datasets to bandit instances, or on only a small number of datasets with comparable or less OPE scenarios. Q4: The performance of the four pooling functions varies in different scenarios. Some choices may appear to perform worse than the baseline. A4: We answer the two questions separately: As we mentioned in the paper, there is no clear winner for which pooling function works the best in the policy convolution framework. This is expected, as different convolution functions obey different underlying assumptions and inductive biases, leading to different bias-variance trade offs in different OPE scenarios. Please note that this behavior is a clear indication for the need of a family of convolution estimators rather than specific instantiations like GroupIPS [1] or offCEM [2]. Please note that we rely on a validation set to choose the optimal convolution function in a particular OPE setting. As described in lines 553-554, we would like to remind you that the figures only include results for PC with a positive amount of pooling (i.e., ), since PC directly encompasses its relevant backbone estimator (when ). Hence, whenever the backbone estimator performs better than PC in the figures, in reality, PC and its backbone have converged. This is a different way of saying that performing no convolution is actually the best decision in that specific scenario. Q5: How does the proposed method perform in terms of computational efficiency? A5: Good question! Comparing IPS (section 2.2.2) vs. PC-IPS (section 3), there is an added convolution operation in PC. There are two things to note here: Run-time of OPE estimators has historically never been a bottleneck, because such policy value evaluations are conducted exactly once, that too after we have a trained policy to evaluate. Further, existing estimators like groupIPS [1], offCEM [2], similarity estimator [3], etc. all have the same time complexity as the PC family of estimators. The convolution operation can be very trivially parallelized using a GPU, where we take a matrix multiplication between for a batch of contexts with a pre-computed action-action similarity matrix. Replying to Response to reviewer jX3e [Part 1/2] Response to reviewer jX3e [Part 2/2] Official CommentAuthors (Dawen Liang, Julian McAuley, Nathan Kallus, Noveen Sachdeva, +1 more)10 Dec 2023, 13:40Program Chairs, Senior Area Chairs, Area Chairs, Reviewers Submitted, Authors Comment: Finally, we would like to end by requesting you to kindly reconsider your final ratings as it strictly decides the outcome of this paper. [1] Jie Peng, Hao Zou, Jiashuo Liu, Shaoming Li, Yibao Jiang, Jian Pei, and Peng Cui. Offline policy evaluation in large action spaces via outcome-oriented action grouping. In Proceedings of the ACM Web Conference 2023. [2] Saito, Yuta, Qingyang Ren, and Thorsten Joachims. "Off-Policy Evaluation for Large Action Spaces via Conjunct Effect Modeling." arXiv preprint arXiv:2305.08062 (2023). [3] Nicolò Felicioni, Maurizio Ferrari Dacrema, Marcello Restelli, and Paolo Cremonesi. Off-policy evaluation with deficient support using side information. Advances in Neural Information Processing Systems, 2022. Official Review of Submission1159 by Reviewer EkM3 Official ReviewReviewer EkM320 Nov 2023, 19:13 (modified: 13 Dec 2023, 19:33)Program Chairs, Senior Area Chairs, Area Chairs, Reviewers Submitted, Reviewer EkM3, AuthorsRevisions Review: This paper tries to address the challenge of off-policy evaluation, crucial for assessing and optimizing new policies. The primary issue tackled is the distribution shift between the logging policy (data generation) and the target policy (evaluation). The common approach of importance sampling, while providing unbiased estimates, often introduces high variance, especially in cases of large action spaces. To overcome this, the authors propose the Policy Convolution (PC) estimator, which strategically convolves logging and target policies using action embeddings, introducing a bias-variance trade-off that can be controlled. However, there are several questions in this paper, and see Questions below for details. Questions: This article tends to focus more on pure reinforcement learning and is not strongly connected to the "User Modeling and Recommendation" track. Although the introduction briefly mentions the work and its relation to the web and recommendation systems (only in one sentence), I feel that the content of the entire article rarely delves into the aspects of recommendation systems. At least, that's the impression I get from the current writing style of the article. Therefore, I am inclined to suggest that this paper might find a better fit in other conferences, such as ICML, NeurIPS, AAAI, etc., rather than WWW. The datasets used in this article lack persuasiveness, including the use of synthetic data and a dataset that is not very recent. If the authors aim to demonstrate the strong performance of the proposed work, it would be advisable to conduct experiments on more up-to-date real-world data, preferably directly related to recommendation systems. It's currently unclear how the proposed policy convolution is directly related to performance improvement. At least, the analysis provided so far doesn't seem comprehensive enough. Ethics Review Flag: No Ethics Review Description: No Scope: 2: The connection to the Web is incidental, e.g., use of Web data or API Novelty: 4 Technical Quality: 4 Reviewer Confidence: 3: The reviewer is confident but not certain that the evaluation is correct Response to reviewer EkM3 Official CommentAuthors (Dawen Liang, Julian McAuley, Nathan Kallus, Noveen Sachdeva, +1 more)10 Dec 2023, 13:41Program Chairs, Senior Area Chairs, Area Chairs, Reviewers Submitted, Authors Comment: Q1: This article tends to focus more on pure reinforcement learning and is not strongly connected to the "User Modeling and Recommendation" track. A1: We would like to strongly disagree here—offline policy evaluation is a crucial problem in recommender systems (e.g., offline A/B test evaluation), which is the primary theme for this track. There is a longstanding line of (highly cited) research (see the references in the paper, e.g., on line 87) on its applications in recommender systems, including ones [1, 2, 3, 4] published at the Web Conference (along with other data mining and recommender systems conferences). From (1) our motivating examples in the introduction; (2) Figure 1; and (3) our empirical evaluation settings (including the movielens dataset); we write the paper with recommender systems as the key application in mind. All in all, we strongly believe that our submission is of core relevance to The Web Conference. Q2. The datasets used in this article lack persuasiveness. A2: The goal of our experiments is to test the internal and external validity of PC. We test internal validity of PC on the self-constructed dataset (as is the case with almost all existing OPE papers), where we can explicitly control the off-policy evaluation instance setup. We test the external validity of PC on movielens dataset, where we try to mimic different real-world OPE scenarios (e.g., with embedding derived from data), encompassing different (1) reward functions, (2) numbers of actions, (3) amounts of bandit data (4) policy mismatches (5) action embedding size. Please see Figures 2 to 31 to appreciate the complexity of our empirical evaluation. To make the scale of the empirical evaluation manageable, we choose to vary different OPE scenarios on one of the most representative recommendation dataset movielens, rather than on multiple datasets but only on specific settings. We would argue that prior works either only test on datasets that are less reflective of real-world scenarios (e.g., converting classification datasets to bandit instances, or on only a small number of datasets with comparable or less OPE scenarios. For dataset selection, as far as we know, there is no good open and real-world dataset for off-policy evaluation. And prior works also resort to simulated data from full-information datasets. We would argue that movielens is one of the most representative datasets in recommender systems. Prior works either use classification datasets which might be even older, and are less reflective of real-world OPE problems in recommender systems, or they only use a single dataset with comparable or less OPE scenarios. Q3: Unclear how the proposed policy convolution is directly related to performance improvement. A3: We would like to strongly disagree here—we compare different OPE methods in terms of mean squared error (MSE), the primary metric of interest in the field of off-policy evaluation, on two datasets with varying settings (reward functions, numbers of actions, amount of bandit data, policy mismatch, action embedding size, etc.). We show that OPE with PC consistently achieves smaller MSE compared to ones without PC. We believe these results demonstrate PC’s performance improvement. It would be great if the reviewer could elaborate more on this point to help us understand why they think the current empirical evaluation does not show that PC is directly related to performance improvement and the analysis does not seem comprehensive. [1] Jie Peng, Hao Zou, Jiashuo Liu, Shaoming Li, Yibao Jiang, Jian Pei, and Peng Cui. Offline policy evaluation in large action spaces via outcome-oriented action grouping. In Proceedings of the ACM Web Conference 2023. [2] Ma, Jiaqi, et al. "Off-policy learning in two-stage recommender systems." Proceedings of The Web Conference 2020. 2020. [3] Wang, Xiangmeng, et al. "Off-policy learning over heterogeneous information for recommendation." Proceedings of the ACM Web Conference 2022. 2022. [4] Sohoney, Saurabh, Nikita Prabhu, and Vineet Chaoji. "Handling Confounding for Realistic Off-Policy Evaluation." Companion Proceedings of The Web Conference 2018. 2018. Replying to Response to reviewer EkM3 Official Comment by Reviewer EkM3 Official CommentReviewer EkM313 Dec 2023, 19:33Program Chairs, Senior Area Chairs, Area Chairs, Reviewers Submitted, Authors Comment: Thanks for your reply. After reading your rebuttal, I tend to raise my score. Official Review of Submission1159 by Reviewer JoEa Official ReviewReviewer JoEa02 Nov 2023, 05:52 (modified: 01 Dec 2023, 22:27)Program Chairs, Senior Area Chairs, Area Chairs, Reviewers Submitted, Reviewer JoEa, AuthorsRevisions Review: The authors provide a comprehensive review of inverse propensity score estimators for off policy evaluation and propose a new estimator based on convolution that has greatly reduced variance as demonstrated in simulations and experiments in the realistic situation where the items have embeddings that can be used in order to identify similar actions. In general I find the presentation to be very well done, the idea novel and the experimental valuation convincing. A great many papers proposing IPS estimation improvements are based on the observations that the ratio of pi/mu can be very high and propose modifications to this ratio. I think this is a rather limited view of the problem, and I am pleased that the authors of this paper introduce a different idea – that some actions are similar and we might reasonably expect the reward of these actions to be similar. This adds bias to the estimation process, but greatly reduces variance and I find the arguments that it can be very beneficial compelling. The core of this idea is to replace the ratio pi(a|x)/mu(a|x) with a new ratio pi’(a|x)/mu’(a|x) where pi’ and mu’ are new policies created by convolving the old policies. The convolution is done with the benefit of additional information in the form of action embeddings. Imagine a1 and a2 are similar actions, the logging policy puts most of the mass on a1, and almost no mass on a2. The new policy we would like to evaluate puts almost no mass on a1 and almost no mass on a2. With traditional IPS this will result in very high importance weights as 1/mu(a2|x) is very high. The policy convolution estimator instead will perform a convolution – this means that brings both pi and mu closer together and therefore reduces variance. There are differences in each of the two convolutions however. A convolution on pi has the interpretation – instead of evaluating pi that is really different from the logging policy let’s evaluate a similar policy pi’ that is similar to it. In contrast, a convolution on mu has the interpretation – as a1 and a2 are similar we can use reward estimated on a1 to learn about a2 and vice a versa. It would seem to me that convolving or modifying mu would only be beneficial in cases where it made mu closer to pi. In the end of the paper the authors compare the method with the marginal IPS estimator (Saito and Joachims 2022). They make the correct observation that the sampling assumptions for the policy underlying this estimator are not realistic in practice. Would the author’s agree that convoluted policy estimation is very similar to marginal IPS estimation – but recommends the use of the method in much broader circumstances? A paper that the authors may have overlooked is Sakhi et al (2020). Like the convolved estimator it also uses action-action similarities, but it also employs context-context similarities. This is done in a direct method formulation via a prior distribution allowing the same treatment for both action-action and context-context similarities. It would seem that an IPS based formulation must develop different methodologies for treating action-action similarities and context-context similarities as the estimator involves ratios pi(a|x)/mu(a|x) where the action and the context have different roles – in contrast direct methods regress the reward off the action and the context. I don’t think it is necessary to do further experiments, but it seems plausible that Sakhi et al (2020)’s method may be competitive with convolved IPS. This is because in addition to using the action-action distances it can also use context-context distances. Furthermore, it does not violate the conditionality principle (Berger and Wolpert 1988). As a final remark, I particularly like the way the authors are careful to highlight the assumption of un-confoundedness. This assumption will typically be satisfied in production environments if all sub-models involved in decision making have access to the complete contextual features – but in practice this is often not the case. Questions: On line 445 the covariance denoted is given a normal distribution - this looks like an error (covariances are usually matrices/scalars variance must be positive). There are a few issues I would be interested on the authors commentary on. • The whole formulation of off policy estimation concerns the reward averaged over the contexts – but in order to find the optimal policy this is neither necessary or sufficient – instead we need only the reward estimator as produced by the direct method. Why then should we focus on correct estimation of “V”? Should users of the direct method be concerned by the results of Robins & Ritrov (1997)? • The authors demonstrate their method with simulated data and experiments based on MovieLens. MovieLens is not really a bandit problem, and it is a bit artificial to try to turn it into one. What is the value of a MovieLens based experiment – except that some referees may like/demand them? • What are the differences between convolving pi and mu? Should they be done in the same way? • How can context-context distances be brought into the proposed method? Any commentary on how the method may compare with Sakhi et al (2020)? • Any commentary on the discussion above. • The size of the action space investigated here are still far below what industrial scale recommendation systems deal with. Is it feasible to apply these methods when action spaces are in the millions or even higher? Ethics Review Flag: No Ethics Review Description: na Scope: 4: The work is relevant to the Web and to the track, and is of broad interest to the community Novelty: 6 Technical Quality: 6 Reviewer Confidence: 4: The reviewer is certain that the evaluation is correct and very familiar with the relevant literature Response to reviewer JoEa [Part 1/2] Official CommentAuthors (Dawen Liang, Julian McAuley, Nathan Kallus, Noveen Sachdeva, +1 more)10 Dec 2023, 13:43Program Chairs, Senior Area Chairs, Area Chairs, Reviewers Submitted, Authors Comment: We really thank you for your time and effort in reviewing the paper and providing us with your valuable feedback! First of all, we are really appreciative of your high-level summary and take-aways in the first few paragraphs of your review—you’re absolutely right in your summary, and give a very relevant off-policy scenario that policy convolutions will help in (one that we couldn’t include in our paper due to brevity!) Below are the responses to your important and well-needed questions, and please let us know if we can add / expand on something: Q1: Would the authors agree that convoluted policy estimation is very similar to marginal IPS estimation – but recommends the use of the method in much broader circumstances? A1: Not really, MIPS [1] will actually degenerate to naive IPS when there is a one-to-one mapping for each action and its embedding, like in most practical settings and in this paper. This is different from the setting used in the original MIPS paper, where each action was accompanied by a stochastic distribution over embeddings. Q2: On line 445 the covariance denoted is given a normal distribution A2: Thanks for spotting this! We’ll fix this in our next revision. Q3: Why should we focus on the correct estimation of “V”. A3: Good question! This is the difference between off-policy evaluation vs. off-policy learning. Talking about the need for off-policy evaluation, a variety of applications demand exact value-estimation of a given policy. For example, (1) predicted mortality rate of a new treatment (policy) based off data collected from a different treatment; (2) predicted business utility (e.g., clicks) of a recommender system using data collected from a different recommender system; (3) resource allocation and optimization; etc. Talking about the need for importance weighting based estimators compared to direct model (reward model) based estimators: in practice, the learned reward model is rarely perfect, due to model misspecification and finite amount of data. Thus, using the direct method to estimate the value of a policy typically suffers from a large (unmeasurable) bias problem. Even in off-policy learning (optimization), there is evidence that policy learning using the direct method estimator performs worse than unbiased estimators (e.g., IPS) based on importance weighting in many settings. Q4: What is the value of a MovieLens based experiment? A4: Recommender systems are very closely tied to off-policy evaluation. A classical problem in recommender systems is evaluation: how to accurately estimate the “value” of a recommender system, since the previously logged data is collected using a different production recommender system, introducing a vast amount of selection bias. Hence, a wide body of literature has focused on the off-policy evaluation problem in the context of recommender systems [2, 3]. To this effect, we constructed off-policy evaluation instances from the movielens dataset, which is arguably one of the most representative datasets in recommender systems. Q5. What are the differences between convolving pi and mu? Should they be done in the same way? A5: Great question! We answer the question from two angles: From an intuitive perspective, in addition to reducing the variance in off-policy estimation as you yourself suggested in your review; an added practical benefit of convolving pi and mu is deficient support, i.e., when pi and/or mu don’t put probability mass on all actions (see [4] for a deeper understanding). This adds irrecoverable bias in OPE, but due to our policy convolution framework, we’re able to recover from this bias (see figure 5) through the added information available to us via action embeddings. From a more empirical perspective, e.g., from Figure 6, we observe that allowing PC to convolve both pi and mu individually (i.e., with different levels of convolution) has much better estimation quality than other combinations like only convolving pi or mu, or convolving them equally. Replying to Response to reviewer JoEa [Part 1/2] Response to reviewer JoEa [Part 2/2] Official CommentAuthors (Dawen Liang, Julian McAuley, Nathan Kallus, Noveen Sachdeva, +1 more)10 Dec 2023, 13:44Program Chairs, Senior Area Chairs, Area Chairs, Reviewers Submitted, Authors Comment: Q6. How can context-context distances be brought into the proposed method? Any commentary on how the method may compare with Sakhi et al (2020)? A6: Great suggestion! Context similarity could potentially be very helpful, e.g., in recommender systems, similar users (contexts) might like similar things. In this work, we focus only on action similarity, but incorporating context similarity might certainly help off-policy evaluation, and would be an interesting future research direction. And thanks for bringing Sakhi et al (2020) to our attention, we’ll make sure to include it in our paper’s next revision! Regarding its comparison with PC; both are quite different ideas. BLOB performs off-policy learning from a direct-method, bayesian perspective. On the other hand, PC is performing off-policy evaluation using importance sampling. Both techniques share the same spirit of using action-action and context-context similarities (in BLOB) to perform their respective tasks more efficiently and accurately. Q7. Is it feasible to apply these methods when action spaces are in the millions or even higher? A7: Great question! We first note that an action space of 10K actions is already much larger than prior works. We believe that the PC family of estimators has the potential to scale to an even larger scale recommender system with even millions of items, since (1) some of our convolution functions are hierarchical, which can elegantly handle much larger action spaces; (2) we will most likely have access to more data in larger scale recommender systems; and (3) seeing Figure 2, the relative gap between naive baselines and the PC family of estimators seems to be increasing with larger action-spaces. We agree that it would be interesting to test how it works in a real-world large-scale recommender system, and we leave it as a direction for future work. Finally, we would like to end by requesting you to kindly reconsider your final ratings as it strictly decides the outcome of this paper, which in your own self-opinion is novel, well-motivated, and contains strong empirical evidence. [1] Saito, Yuta, and Thorsten Joachims. "Off-Policy Evaluation for Large Action Spaces via Embeddings." International Conference on Machine Learning. PMLR, 2022. [2] Ma, Jiaqi, et al. "Off-policy learning in two-stage recommender systems." Proceedings of The Web Conference 2020. 2020. [3] Narita, Yusuke, Shota Yasui, and Kohei Yata. "Debiased off-policy evaluation for recommendation systems." Proceedings of the 15th ACM Conference on Recommender Systems. 2021. [4] Sachdeva, Noveen, Yi Su, and Thorsten Joachims. "Off-policy bandits with deficient support." Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2020.