Official Review of Paper286 by Reviewer qS9G KDD 2023 Conference Research Track Paper286 Reviewer qS9G 03 Apr 2023KDD 2023 Conference Research Track Paper286 Official ReviewReaders: Program Chairs, Paper286 Area Chairs, Paper286 Reviewers Submitted, Paper286 Authors Summary: This paper proposes an approach for reranking items in recommendation systems. The reranking takes into consideration all the items in the list and generates reordered items one at a time, conditioned on the previous items. The model is based on Transformer [similar to 24] and the loss is based on flow networks [4]. In that sense it is a bit incremental (applying an existing training loss to an existing architecture for a known problem). I also found the description of flow networks hard to follow, and I feel that its presentation could improve. The authors also point out that the proposed approach requires more tuning of hyperparameters compared to existing approaches, so it seems to be more expensive. Overall, I found the approach interesting (albeit fairly incremental), but I feel like the presentation needs to improve. See detailed comments below. Paper Strength: List-wise ranking algorithms are important and interesting, so it is good to see additions to the arsenal of methods in this area. Paper Weakness: The paper seems somewhat incremental: it applies an existing training procedure (from [4]) to an existing architecture (from [24]) for a known problem (list-wise ranking). The quality of presentation could improve: the explanation on flow networks seem to require prior knowledge of the approach. It would be good to explain it without assuming prior knowledge. The experiments show that different variants of the proposed approach achieve better results on different metrics (e.g., diversity vs reward), so it is not clear which variant one would want to use in practice. Questions To Authors And Suggestions For Rebuttal: Comments: Section 3.2 was not so clear to me. Perhaps you could start by describing the flow network idea more generally and after that say how it is applied here. This is important because the main contribution of the paper is applying flow networks to item reranking. The ILD metric should be high for a policy that selects random slates. How would that compare to the other algorithms in terms of ILD? The results in Table 2 seem to show that GFN performs well in terms of reward while GFN(Explore) performs well in terms of diversity, as expected. It is suggested that GFN(Explore) offers a better tradeoff since it also has competitive performance in rewards. So do you recommend using that in practice? Need to be clear about that since those are two different policies. A similar question arises for TB vs DB: the latter seems better in terms of diversity while the former is better for rewards. So which gives the best trade-off? Perhaps it would be helpful to show this in 2D (reward by diversity), but not sure. Is there a conclusion regarding the preference for TB vs DB? When is it better to use one over the other? This was not clear from the experiments. Section 4.3.2: it is mentioned that GFN_DB has larger error variance compared to GFN_TB (Figures 5 and 6, left). It also looks like it has lower coverage variance compared to GFN_TB (Figures 5 and 6, right). Is this also expected? Minor / Typos: 142: defines => define 165: either “we define” or “is defined”, remove the other 381: state that TB stands for “trajectory balance” (and DB for “detailed balance”). The first time it is written explicitly is only much later (section 4.3.2). 488: requires => require 507: optimal => optimum 509: aligns => align 702: in Table 2, GFN_DB(Explore) should be bold for ML1M since 3.776>2.972 709: Table 3, ML1M, CF, ILD: missing underline 712: Table 3, KR1K, PRM, AvgR: remove underline 815: suffer => suffers 834: sample => samples 893: on => one 1010: I am guessing that RerankTopK is actually RerankCF in Table 3, make the naming consistent. Technical Quality: 3: Good - The approach is generally solid with minimal adjustments required for claims that will not significantly affect the primary results Presentation: 2: Fair - The paper is generally understandable, but certain sections require further refinement. Contributions: 2: Fair - The study may be of relevance within a specific research community. Overall Assessment: 2: Borderline reject: The paper is technically solid but has limitations, that outweigh the reasons to accept. Confidence Level: 3: The reviewer is fairly confident that the evaluation is correct. [–] Rebuttal for Reviewer qS9G KDD 2023 Conference Research Track Paper286 AuthorsShuchang Liu(privately revealed to you) 13 Apr 2023 (modified: 13 Apr 2023)KDD 2023 Conference Research Track Paper286 Official CommentReaders: Program Chairs, Paper286 Area Chairs, Paper286 Reviewers Submitted, Paper286 Authors Comment: Part I Thanks for the detailed constructive feedback. Here we address each concern as following: Q: The paper seems somewhat incremental: it applies an existing training procedure (from [4]) to an existing architecture (from [24]) for a known problem (list-wise ranking). A: Though not directly listed in the contribution, we described the following insights from GFlowNet as the main motivation in our paper: 1) The log-scale reward, and 2) the auto-regressive generation based on future value. While the first point generates a new problem formulation in a very general way, the second point aims to combine the idea of RL and the idea of sequence generation. These two points are not unique in GFlowNet but are relatively new to the domain of list rec. In addition, there are several other technical efforts made by our method to accommodate the original GFlowNet[4] into list-rec problem: user-based initial state --- note that the initial state in the original GFlowNet[4] is usually the same and not personalized as in list rec; [4] also emphasizes the problem of path aggregation in DAG graph, yet our generative framework adopts a simple and effective generation tree which avoids the problem; the action space in list rec is much larger (3706 or 69219 items) than that in [4] (around hundreds of action), this is usually considered as one of the main challenge of applying RL-based solution in rec sys. On the other hand, PRM[24] adopts a reranking strategy in a two-stage framework while our GFN4Rec solution is a single-stage method. Yet, investigation of GFN4Rec’s performance in a two-stage framework could be a potential extension to our work. Q: The experiments show that different variants of the proposed approach achieve better results on different metrics (e.g., diversity vs reward), so it is not clear which variant one would want to use in practice... A: TB vs. DB: So far we do not observe a consistent performance gap between TB and DB. Theoretically, DB only observes the real label for the last generation step, so its loss has larger bias though its variance may be lower. In contrast, TB loss combines the multiple steps for one instance, so may induce larger variance. Therefore, in our setting where the list size is fixed, DB is potentially more suitable for larger action space (item candidate set) and problems of larger-scale, which might explain the better performance of GFN_DB in KR1K when trained with offline data (Table 3). In general, their behavior could be dependent on the data characteristics. Note that when engaging online learning, this might be a different story since exploration effectiveness and efficiency also determines the final performance. Table 1 may indicate GFN_TB > GFN_DB in rewards, but may have a trade-off with diversity metrics, so we remain skeptical of the superiority of TB over DB. Nevertheless, we agree that this is a very important point that requires further investigation in the future. Additionally, [24] has mentioned that TB is better at learning and generating longer sequences, but this is no longer verifiable in our case since we assume a fixed list size for each generation process. Yet, the boosted learning curve of TB in the early stage could be verified by Figure 5 and Figure 6, which is consistent with the observation of [24]. We will add some of these discussions in section 4.3.2. Greedy vs Explore: this problem is closely related to the exploration-exploitation trade-off in RL problem. In practice, one could actually control the ratio of exploration if using GFN4Rec. In other words, the system can manually choose which strategy to use when generating actions at inference time, and which strategy pushes its data towards the training buffer. Investigation of this control could be a very important point towards industrial level solution. In our work, we only aim to show the effectiveness of GFN4Rec in finding diverse and high-quality recommendations when employing exploration. Sometimes the exploration variants could surpass the baselines, but it can always find better rewards than other baselines when using greedy strategy in our experiments. Q: The ILD metric should be high for a policy that selects random slates. How would that compare to the other algorithms in terms of ILD? A: You can expect a random policy to give a much higher ILD as the accuracy and rewards are much lower. One way to observe this is to look at the starting performance of different models. For example, in Figure 7, we can see that ILD is around 0.65 at the beginning (closest to random with only the model’s initial bias) and drops to around 0.55 after 5000 iterations. In contrast, its reward is around 1.8 at the beginning and increases to 2.4 after training. In our observation, offline learning of ListCVAE achieves a result close to this near-random beginning point indicating an unstable incomparable recommendation performance. [–] Continued Rebuttal for Reviewer qS9G KDD 2023 Conference Research Track Paper286 AuthorsShuchang Liu(privately revealed to you) 13 Apr 2023 (modified: 13 Apr 2023)KDD 2023 Conference Research Track Paper286 Official CommentReaders: Program Chairs, Paper286 Area Chairs, Paper286 Reviewers Submitted, Paper286 Authors Comment: Part II Q:​​ Section 4.3.2: it is mentioned that GFN_DB has larger error variance compared to GFN_TB (Figures 5 and 6, left). It also looks like it has lower coverage variance compared to GFN_TB (Figures 5 and 6, right). Is this also expected? A: This is a very insightful question, and the coverage variance has rarely been discussed in the field of recommendation. Following this direction, we may notice that GFN_TB provides slower drops on coverage and this could be a major reason for its larger variance during training. However, its reward grows faster and the reward variance is lower. In this sense, GFN_TB might be a better choice in terms of learning stability. According to our experiments, there is still no guaranteed better accuracy-diversity trade-off for GFN_TB over GFN_DB or the other way around. Q: The quality of presentation could improve: the explanation on flow networks seem to require prior knowledge of the approach. It would be good to explain it without assuming prior knowledge. (And in question section: ) Section 3.2 was not so clear to me. Perhaps you could start by describing the flow network idea more generally and after that say how it is applied here. This is important because the main contribution of the paper is applying flow networks to item reranking. A: Indeed the GFlowNet introduces lots of terminologies into the list rec problem formulation. We will add a brief background description to give a head start introduction of GFlowNet and point out that the intuition behind it is purely probabilistic inference. In our case, for example, we believe that “quality of list = flow of state = probability of generating the list” are identical in their semantics. We will also add a reminder to readers that the list rec problem in our setting is NOT a reranking solution and there is no assumption of existing initial ranker. The introduction of the PRM baseline is mainly used to verify that GFN can potentially achieve the same or better performance than a reranking model. Q: Minor/Typos A: Appreciated the efforts for pointing out these typos, all will be fixed. [–] Rebuttal for Reviewer qS9G [Deleted]Shuchang Liu(privately revealed to you) KDD 2023 Conference Research Track Paper286 Official CommentReaders: Program Chairs, Paper286 Area Chairs, Paper286 Reviewers Submitted, Paper286 Authors [–] Official Review of Paper286 by Reviewer XrfJ KDD 2023 Conference Research Track Paper286 Reviewer XrfJ 02 Apr 2023 (modified: 02 Apr 2023)KDD 2023 Conference Research Track Paper286 Official ReviewReaders: Program Chairs, Paper286 Area Chairs, Paper286 Reviewers Submitted, Paper286 Authors Summary: The paper proposes a list-wise recommendation framework (GFN4Rec) that uses a generative flow network to represent the probabilistic list generation process. The proposed framework consists of two main parts: a flow estimator that models the probability distribution of user-item interactions, and a reward estimator that predicts the expected reward of each item in the list. To validate the effectiveness of the proposed method, the authors performs both offline and online experiments on two real-world datasets. The results shows the proposed method outperforms several state-of-the-art methods on selected benchmark metrics. All these evidence shows GFN4Rec is able to find high-quality recommendations with better diversity as an online learning framework. Overall, this paper presents a promising approach for improving list-wise recommendation and provides valuable insights into how different model components and hyperparameters impact performance. Paper Strength: One of the key strengths of the paper is that the proposed model has the ability to to model the item's mutual influence in a list-wise recommendation problem, which can improve recommendation quality by modeling intra-list correlations of items that are exposed together. The paper also discuss the theoretical reasoning behinds this behavior in section "Reward vs. Log-scale Reward", in which case. items with high scores are less distinguishable than those with lower scores, so items with less scores also have a good chance to be selected, which leads to a better diversity of recommendations. In addition, The ablation study in this paper is well-designed and provides valuable insights into the performance of the proposed method. The ablation study in this paper is well-designed and provides valuable insights into the performance of the proposed method. Specifically, the author compares greedy vs exploration and discussed feasibility of online simulator etc. Overall, the ablation study provides strong evidence for the effectiveness of the proposed method and highlights its strengths relative to other approaches. Paper Weakness: The authors mentioned that GFN4Rec requires more hyperparmeters which requires more empirical efforts to find a feasible optimization settings than standard supervised learning approaches. It might also requires a large amount of training data to achieve optimal performance, which may not be feasible in some real-world applications. With these newly introduced hyperparmeters, it might be helpful to conducting more extensive sensitivity analyses to evaluate the robustness of their results to different hyperparameters or model configurations. The author could also consider provide more detailed explanations or visualizations of how different components or features (profile features, recent history separately) contribute to the overall performance. Questions To Authors And Suggestions For Rebuttal: In the discussion section, can you please consider more extensive sensitivity analyses to evaluate the robustness of their results to different hyperparameters or model configurations. The paper introduces a few more hyperparameters and mentioned it might take more empirical efforts to find a optimized configuration. As a reader, I am curious how the model behavior varies with different sets of hyperparameters. Technical Quality: 4: Excellent - The approach is well-justified and all claims are convincingly supported. Presentation: 4: Excellent - The paper is well written, making it a delightful read with a clear and easy-to-follow structure. Contributions: 3: Good - Could help ongoing research in a broader research community. Overall Assessment: 4: Accept: The paper is technically solid and has a high impact on at least one sub-area. Confidence Level: 3: The reviewer is fairly confident that the evaluation is correct. [–] Rebuttal for Reviewer XrfJ KDD 2023 Conference Research Track Paper286 AuthorsShuchang Liu(privately revealed to you) 14 Apr 2023 (modified: 14 Apr 2023)KDD 2023 Conference Research Track Paper286 Official CommentReaders: Program Chairs, Paper286 Area Chairs, Paper286 Reviewers Submitted, Paper286 Authors Comment: Thanks for the constructive feedback. Here are our answers to the concerns: Q: The authors mentioned that GFN4Rec requires more hyperparmeters which requires more empirical efforts to find a feasible optimization settings than standard supervised learning approaches. With these newly introduced hyperparmeters, it might be helpful to conducting more extensive sensitivity analyses to evaluate the robustness of their results to different hyperparameters or model configurations.In the discussion section, can you please consider more extensive sensitivity analyses to evaluate the robustness of their results to different hyperparameters or model configurations. The paper introduces a few more hyperparameters and ... A: For the sensitivity test, we found the sensitivity of hyper-parameters may change according to the environment and datasets. For example, GFN_TB not so sensitive to b_f and GFN_DB not so sensitive to b_r near the observed best setting in KR1K. And in almost all experiments, we found b_z=1.0 consistently gives good results so it is fixed when searching b_f and b_r. Due to limited space, we will only add conclusions on the sensitivity in the paper, while the detailed comparison has to be presented through additional materials. Thus, we will include the corresponding running scripts with the best settings in the source code upon release if published.To give an early view, here we present some of the tests: KR1K: GFN_DB Sensitivity of b_f (given b_r=0.4, b_z=1.0) b_f 0.1 0.3 0.5 0.7 0.9 Avg R 1.810 2.214 2.107 2.039 2.015 Max R 3.597 3.950 3.915 3.865 3.849 Coverage 24.310 35.263 398.435 515.572 496.427 ILD 0.546 0.553 0.597 0.647 0.661 Sensitivity of b_r (given b_f=0.3, b_z=1.0) b_r 0.2 0.4 0.6 0.8 1.0 Avg R 2.206 2.212 2.254 2.266 2.233 Max R 3.955 3.967 3.990 3.992 4.003 Coverage 35.871 35.957 35.737 35.063 ILD 0.583 0.621 0.569 0.564 0.577 Best setting b_r=0.8, b_f=0.3-0.5 GFN_TB: Sensitivity of b_f (given b_r=0.5, b_z=1.0): b_f 0.1 0.3 0.5 0.7 0.9 Avg R 2.379 2.359 2.268 2.371 2.367 Max R 4.083 4.066 4.050 4.058 4.042 Coverage 18.976 22.161 49.588 16.894 16.800 ILD 0.554 0.525 0.532 0.565 0.531 Sensitivity of b_r (given b_f=1.0, b_z=1.0) b_r 0.1 0.3 0.5 0.7 0.9 Avg R 2.414 2.401 2.374 2.384 2.377 Max R 4.054 4.053 4.040 4.042 4.048 Coverage 21.267 19.082 18.839 18.212 ILD 0.520 0.522 0.540 0.523 0.522 Best setting b_r=0.1, b_f = 1.0 In practice, we suggest readers to adopt an interactive line-search strategy that search one parameter at a time and fix the best point then search the next, it usually converges in one or two rounds and avoids the tensor search space. Q: The author could also consider provide more detailed explanations or visualizations of how different components or features (profile features, recent history separately) contribute to the overall performance. A: This could be a very practical viewpoint for future real-world deployment since this involves feature engineering and analysis. In our paper, we only focus on the verification of the overall learning framework and the possible improvements GFN may bring to the list rec field. Thus, in the experiments, we keep the profile and history encoder the same for all models (both GFN4Rec and baselines). Q: It might also require a large amount of training data to achieve optimal performance, which may not be feasible in some real-world applications. A: In its training procedure, we find that GFN4Rec is closer to traditional supervised learning with auto-regressive models and far away from standard RL-based methods. Algorithmically, each sample of GFN has K (i.e. the list size) items, this corresponds to K samples in supervised learning. And both DB loss and TB loss adopts standard MSE loss in essence. In empirical study, as we have described in section “4.2 Offline Learning”, we train all models including GFN4Rec with the same number of steps (5000 steps, 128 training samples in each step) where they converge approximately at the same rate. In online learning, we observed that the supervised learning CF model even converges slower (sometimes requires 8000 steps) than GFN in ML1M which may be related to their exploration ability. Yet, this might not be a general phenomenon, so it is not further discussed in our paper. Indeed the problem of sample efficiency is critical to all online learning problems, but in general, you can expect the training GFN to be close to supervised online learning in terms of efficiency. [–] Official Review of Paper286 by Reviewer qAbj KDD 2023 Conference Research Track Paper286 Reviewer qAbj 02 Apr 2023KDD 2023 Conference Research Track Paper286 Official ReviewReaders: Program Chairs, Paper286 Area Chairs, Paper286 Reviewers Submitted, Paper286 Authors Summary: Personalized recommender systems fulfill the daily demands of customers and boost online business. While most existing methods learn a point-wise scoring model that predicts the ranking score of each individual item, recent research shows that list-wise approach can further improve the recommendation quality by modeling the intra-list correlations of items that are exposed together. However, it is challenging to search the large combinatorial space of list actions and existing methods that use cross entropy loss may suffer from low diversity issues. In this paper, the authors proposed GFN4Rec, a generative method that takes the insights of the flow network to ensure the alignment between list generation probability and its reward. Lastly, the authors conducted experiments on simulated online environments as well as an offline evaluation framework for two real-world datasets. Paper Strength: The paper has a good motivation. The proposed method is well-written. The paper has a reasonably good related work. The paper has done experiments on two real-world datasets with thorough experiments. Paper Weakness: The datasets have been heavily processed before the experiments. It would be good to see more larger datasets. It would be good to see how the proposed method can be deployed in online experiments. Questions To Authors And Suggestions For Rebuttal: N/A Technical Quality: 2: Fair - The proposed approach appears reasonable, however, certain fundamental assertions lack sufficient justification. Presentation: 3: Good - The paper clearly explains the technical parts, but the writing could be improved to better understand its contributions. Contributions: 3: Good - Could help ongoing research in a broader research community. Overall Assessment: 3: Borderline Accept: The paper is technically solid but has limitations. However, the reasons to accept outweigh the reasons to reject. Confidence Level: 4: The reviewer is confident but not absolutely certain that the evaluation is correct. [–] Rebuttal for Reviewer qAbj KDD 2023 Conference Research Track Paper286 AuthorsShuchang Liu(privately revealed to you) 13 Apr 2023 (modified: 13 Apr 2023)KDD 2023 Conference Research Track Paper286 Official CommentReaders: Program Chairs, Paper286 Area Chairs, Paper286 Reviewers Submitted, Paper286 Authors Comment: Thanks for the constructive feedback. Here are our answers for the concerns: Q: The datasets have been heavily processed before the experiments. It would be good to see larger datasets. A: We believe the reviewer is referring to typical reranking tasks where the candidate set is on a much larger scale, while our setting is considering a standard one-stage list-rec problem like [13, 18, 19, 28] where the datasets are in similar scale as ours. Additionally, the original GFN paper was not initially proposed to such an extremely large-scale problem, and only conducted experiments on a dataset with hundreds of actions, which is even smaller than our task. On the other hand, in industrial rec sys, the whole process is usually multi-stage. While beginning stages (like recall and candidate generation stage) may have to deal with a large number of items and users, the later stage may just need to select dozens of items from an intermediate candidate set with hundreds of items. Which could be closer to the experimental setting in our paper. In general, when incorporating GFN in large-scale rec sys, it is more suitable for later stages that conduct more fine-grained ranking than early stages that require computational efficiency for heavy processing of items. However, it would be interesting to investigate if GFN still keeps its effectiveness for even larger action-state space. So far, we believe the selected datasets sufficiently support our claims on the two insights the GFN provides: the log-scale rewards and the auto-regressive model that optimizes future rewards in each step. We will add a brief discussion on this confusion in the experimental setup section. Q: It would be good to see how the proposed method can be deployed in online experiments. A: The online experiments with A/B test would be a direct and most accurate way to validate the effectiveness of a method, but this is not always available due to the restricted industrial environment and resources. Thus, we adopt a simulator-based evaluation to mitigate the gap similar to existing approaches [12]. To ensure the feasibility of the simulator we use an additional offline training-evaluation process to further verify that the model behaviors are consistent. In essence, the offline evaluation has been commonly accepted as an indicator of model effectiveness in the actual online environment, if different models behave in the same way in both offline evaluation and the online simulation, then this proves the feasibility of the simulator. [–] Official Review of Paper286 by Reviewer U9uX KDD 2023 Conference Research Track Paper286 Reviewer U9uX 19 Mar 2023KDD 2023 Conference Research Track Paper286 Official ReviewReaders: Program Chairs, Paper286 Area Chairs, Paper286 Reviewers Submitted, Paper286 Authors Summary: This paper proposes a technique, GFN4Rec, to learn a policy that can generate a sufficiently diverse item list for users in RS while maintaining high recommendation quality. The authors present experiment results on simulated online environments and an offline evaluation framework using 2 real-world datasets. Paper Strength: (1) This paper is easy to follow. The authors did a great job describing the motivation and related work. (2) The Experiments section demonstrates how the proposed technique compares with selected baselines in both simulated online learning and offline learning. (3) The explanation of "Reward vs Log-scale Reward" in Section 3.4 is compelling. Paper Weakness: (1) As stated in line 15, the main issue that listwise recommendation is not yet widely adopted in real-world RS is that "it is challenging to search the large combinatorial space of list actions". It is not clear to me how this paper moves us closer to solving this challenging problem. (2) Both of the 2 datasets in Table 1 appear to have relatively small number of users and small number of items. These datasets don't seem to be good representative examples of real-world RS. I would suggest that the authors add some analysis and discussion about how GFN2Rec scales to larger number of items. (3) Line 267, "the item is randomly sampled based on the softmax selection score". Does this mean that GFN4Rec needs to compute the score for each item at every step? If so, would this be a very expensive operation, in particular for a large corpus of items? (4) Line 689, "we observe both the test performance under greedy selection and that under categorical sampling". What is "categorical sampling"? (5) In addition to the comparison presented in Tables 1-3, it would be helpful to see how they compare in both training and inference time. (6) Line 165, "we define the listwise reward is defined as ..." doesn't make sense. Questions To Authors And Suggestions For Rebuttal: Pls address the few questions which I raise in "Paper Weakness" above. Technical Quality: 3: Good - The approach is generally solid with minimal adjustments required for claims that will not significantly affect the primary results Presentation: 3: Good - The paper clearly explains the technical parts, but the writing could be improved to better understand its contributions. Contributions: 3: Good - Could help ongoing research in a broader research community. Overall Assessment: 3: Borderline Accept: The paper is technically solid but has limitations. However, the reasons to accept outweigh the reasons to reject. Confidence Level: 4: The reviewer is confident but not absolutely certain that the evaluation is correct. [–] Rebuttal for Reviewer U9uX KDD 2023 Conference Research Track Paper286 AuthorsShuchang Liu(privately revealed to you) 13 Apr 2023 (modified: 14 Apr 2023)KDD 2023 Conference Research Track Paper286 Official CommentReaders: Program Chairs, Paper286 Area Chairs, Paper286 Reviewers Submitted, Paper286 Authors Comment: Part I Thanks for the constructive feedback. Here are our answers for the concerns: Q1: As stated in line 15, the main issue that listwise recommendation is not yet widely adopted in real-world RS is that "it is challenging to search the large combinatorial space of list actions". It is not clear to me how this paper moves us closer to solving this challenging problem. A: Solutions to the large action space problem in list rec consist of evaluator-based, generator-based, and the evaluator-generator framework. We regard our method as both evaluator-based and generator-based since it is essentially an energy-based model [17]. Thus, we do not consider this as the unique contribution of our work. Will remove this point in the abstract since this is causing confusion. Q2: Both of the 2 datasets in Table 1 appear to have a relatively small number of users and small number of items. These datasets don't seem to be good representative examples of real-world RS. I would suggest that the authors add some analysis and discussion about how GFN2Rec scales to a larger number of items. A: This is the same problem as reviewer qAbJ's first concern in "weakness", so I am duplicating the answers of here: We believe the reviewer is referring to typical reranking tasks where the candidate set is on a much larger scale, while our setting is considering a standard one-stage list-rec problem like [13, 18, 19, 28] where the datasets are in similar scale as ours. Additionally, the original GFN paper was not initially proposed to such an extremely large-scale problem, and only conducted experiments on a dataset with hundreds of actions, which is even smaller than our task. On the other hand, in industrial rec sys, the whole process is usually multi-stage. While beginning stages (like recall and candidate generation stage) may have to deal with a large number of items and users, the later stage may just need to select dozens of items from an intermediate candidate set with hundreds of items. Which could be closer to the experimental setting in our paper. In general, when incorporating GFN in large-scale rec sys, it is more suitable for later stages that conduct more fine-grained ranking than early stages that require computational efficiency for heavy processing of items. However, it would be interesting to investigate if GFN still keeps its effectiveness for even larger action-state space. So far, we believe the selected datasets sufficiently support our claims on the two insights the GFN provides: the log-scale rewards and the auto-regressive model that optimizes future rewards in each step. We will add a brief discussion on this confusion in the experimental setup section. Q: Line 267, "the item is randomly sampled based on the softmax selection score". Does this mean that GFN4Rec needs to compute the score for each item at every step? If so, would this be a very expensive operation, in particular for a large corpus of items? A: For a list size K, GFN4Rec will run the forward function K times (, each time sample an item according to the item probability) and the overall inference efficiency is identical to all existing auto-regressive solutions. The whole list generation like ListCVAE only runs the forward function 1 time, but still needs to sample item K times. In our experiments, we do observe faster computation for ListCVAE solutions but they are not comparable in performance. Additionally, both have to sample K times from the candidate set. In our experiments, we use GPUs to process the computation, so the differences in the running time becomes even smaller. For example, ListCVAE has total inference+training time around 4800 seconds for 5000 steps while GFN uses around 5500 seconds on KR1K, and CF uses 3341 seconds but does not engage any generation and sampling process. Borrowing the discussion in the previous question, all auto-regressive models used in rec sys may induce high computational cost when generating a large output list but are better suited in later stages especially ranking and reranking stages where the candidate set and the output list are smaller. We will add the discussion of complexity in the experimental results. Q: Line 689, "we observe both the test performance under greedy selection and that under categorical sampling". What is "categorical sampling"? A: Categorical sampling refers to the sampling of each auto-regressive step. The item is selected based on their probabilities, as mentioned in section 3.1 --- “generation tree”. We will also add a brief clarification in the experiment part in the paper to solve this confusion. Q: Line 165, "we define the listwise reward is defined as ..." doesn't make sense. A: The typo will be fixed, should be “we define the listwise reward as …” [–] Continued Rebuttal for Reviewer U9uX KDD 2023 Conference Research Track Paper286 AuthorsShuchang Liu(privately revealed to you) 13 Apr 2023 (modified: 13 Apr 2023)KDD 2023 Conference Research Track Paper286 Official CommentReaders: Program Chairs, Paper286 Area Chairs, Paper286 Reviewers Submitted, Paper286 Authors Comment: Part II Q6: In addition to the comparison presented in Tables 1-3, it would be helpful to see how they compare in both training and inference time. A: If the reviewer is referring to the computational complexity of the model, then we want to redirect this toward the answer to Q3 (of reviewer U9uX). If the reviewer is referring to the performance comparison of training time and inference time. Table 3 is actually approaching this idea. The models are trained using offline data. For evaluation, while the online metric shows inference time evaluation on the simulator, the offline ranking metrics (metrics with “(test)” suffix) shows inference time evaluation on the test datasets. And our online learning setting assumes a continuous evaluation, which means training and evaluation happens all the time throughout the iterations. One way to see these results is regarding Table 3 as inference time offline evaluation, Table 1 and table 2 as both training time and inference time evaluation.