Comments to the Author: Authors' effort in this revision is appreciated. It is encouraged to consider R#3's comments to strengthen Section 3.2 and the discussion of the connections between works, when preparing the final version. Reviewer(s)' Comments to Author: Reviewer: 1 Comments to the Author The authors made great efforts to address my concerns and suggestions in the revision, and the current version is satisfactory. I do not have additional questions or queries. Reviewer: 2 Comments to the Author Thank you for the revisions. All my concerns have been well addressed. I recommend to accept this paper. Reviewer: 3 Comments to the Author I appreciate the author's efforts on addressing my concerns. Although the organization of Section 3.2 still seems simple and the discussion of connection between works is not strong, I think it's ok to publish the paper now. ===== Major revision ===== Associate Editor comments: Associate Editor Comments to the Author: This paper provides a survey on offline reinforcement learning (RL) for recommendation systems. Authors review the existing techniques in offline RL, RL4RS, and the future directions. The topic is timely and interesting to the community. There are several important concerns raised by the reviewers that are expected to be addressed through a revision. - The Section 3 for reviewing current work needs to be further structured for a clearer illustration and categorisation. - Although Section 2 has a precise description of RL setting, it is suggested to have a concise structure and a closer connection with recommender systems. - There is a lack of discussion and differentiation between general RL4RS and offline RL4RS. Authors are suggested to give a more motivated description. - There are missing related work in offline RL4RS. Reviewer Comments: Reviewer: 1 Comments to the Author The paper addresses the challenges and opportunities of using offline reinforcement learning (RL) in recommender systems. Key challenges include data inefficiency in RL-based systems and limited research on offline RL for recommendations. There are also concerns about data quality and achieving explainability. On the positive side, offline RL presents opportunities to leverage existing datasets, improve recommendation quality, and optimize advertising strategies. The potential of offline RL to enhance A/B testing and increase click-through rates is also highlighted. I believe the challenges and opportunities are well summarized and will serve as a good guide for future researchers who want to delve into this domain. Overall, this survey is well-written and well-organized. It offers new researchers in the RLRS domain a comprehensive view of how Reinforcement Learning can be applied to Recommender Systems. Here are my comments: 1. The authors provide a lot of RL formulas for beginners to start with. However, I think some corrections are needed. For example, in Eq. (15), I believe the term "a\sim\pi" should be a subscript to reflect its expectation. Meanwhile, in Eq.(3), the term Q_u seems to be new here, does it refers to Q_\pi? Or it should be Q_u, if so, I think more justification is required. 2. In section 2.5, the problem formulation part, the authors using the U and I to reflect the set of users and items respectively. It would be better to include brackets around the elements of sets I and U. It would be better for the reader to understand that the U and I are sets instead of a sequence. 3. In the RL section, there do have two different notations: Q_pi and \hat{Q}_\pi. If this appears in a standard RL textbook, \hat{Q} may refer to the optimal value of Q. But given that there is no standard definition of \hat{Q}, it may lead to confusion. Hence, further justification about the difference between \hat{Q} and Q are expected. 4. The figure 2 is good but do need some further works. For example, the captions of Fig2 (a), (b), and (c) should be capitalized. Moreover, the subscript ‘t’ in s_t,r_t,a_t is too small which makes reader hard to identify. It would be better if the author can adjust the font size to improve the readability. 5. In Eq. (10), it seems like the index starts from 1. Is there any specific measure considered? It’s the first time that the index starts from 1 in this survey. After going through the description and based on previous content, I think the index should starts from 0. There is no definition of s_1, but we do have a definition of s_0 refers to initial distribution of the state. 6. What’s the d stands for? I assume it stands for the distribution. But given there is no definition of that, I think a short explanation should be provided to increase the readability. Reviewer: 2 Comments to the Author Thanks for submitting to the TOIS journal. This work provides a comprehensive review of recent advancements in offline Reinforcement Learning (RL) for recommendation systems and outlines potential challenges and future directions. Upon careful review, I noted several strengths in your work: 1. The focus on offline RL for recommendation systems is timely and relevant. Offline RL is suitable for recommendation scenario and potentially become a hot research topic. 2. This work identifies key challenges and future directions in the field, which could serve as a catalyst for further research. I also have some concerns: 1. The manuscript dedicates substantial space to the fundamentals of reinforcement learning. I recommend condensing these sections (2.1-2.3) or integrating these concepts within the context of recommendation systems. 2. The motivation for employing offline RL in recommendation systems could be more prominently emphasized. It would be beneficial to address why RL is applicable to recommendation systems and why offline RL is a superior choice. These pivotal questions should be addressed in Sections 1 and 2. 3. Before delving into the review of recent work on offline RL for recommendation systems (RL4RS), it would be helpful if the authors could discuss the specific challenges in RL4RS and how existing methods either address or fail to address them. This would make the reviews better organized and motivated. 4. I find the categorization of recent offline RL4RS somewhat unclear. Firstly, merely listing the methods of offline RL in Section 3.2 seems insufficient. A taxonomy of offline RL would be more informative. Secondly, the distinction between off-policy learning with logged data and offline RL4RS is not clear.. Some minor ones: 5. There are some typos: a. \tau~\pi should be subscript in Equation (5). Similar issues are presented in Equations (6), (11) and (12). b. A typographical error exists on Page 16, Line 11: ‘there is ‘s’ another widely … ‘. 6. Given the numerous notations used in the manuscript, it would be beneficial to include a notation table for reference. 7. Some important related work are missing, such as: [a1] Tois’23: CIRS: Bursting filter bubbles by counterfactual interactive recommender system [a2] SIGIR’23: Offline evaluation for reinforcement learning-based recommendation: a critical issue and some alternatives [a3] Recsys’23: Integrating Offline Reinforcement Learning with Transformers for Sequential Recommendation I look forward to seeing these revisions and believe they will significantly improve the manuscript. Reviewer: 3 Comments to the Author This paper provides a survey on existing studies of offline RL methods for recommendation systems. The authors have detailedly explained the definition and formulation of RL methods for RS, and then discuss a couple of papers on fully off-policy RL and offline RL for RS. After that, the authors have discussed a couple of challenges in existing offline RL for RS studies and a couple of directions that are potentially fruitful for future studies. ## Strengths: - The paper is mostly well written. It’s a pleasure to read. - The topic of the paper is timely and very important. I think people would appreciate such a survey in the community. - The introduction and formulation of the RL methods are clear. The terminology used in RL4RS papers is a mess, and it’s great that this paper has spent a significant part of the pages to explain them. ## Weakness: - A problem of the current survey is that, in most cases, the authors simply discuss related work one by one in Section 3. While the organization of Section 3.1 is ok, Section 3.2 is not acceptable to me because it should be the most important section of this paper (directly talks about offline RL4RS) but the authors simply list each work one by one. There should be multiple methods to better organize the content, such as grouping by their similarities in methods, tasks, experiment design, evaluation paradigm, etc. - Kind of related to the limitation above, it’s a bit frustrating to see that the survey only covers the basic descriptions and discussions of each paper’s method in text, while ignoring the discussions of their downstream application scenarios and experimental design. To facilitate research in this direction, people would like to know how existing studies organize the experiments and evaluation of different methods. It’s ok to not provide experiment analysis directly with the paper’s methods, but I think the authors should have at least a section talking about how existing studies find datasets, design evaluation, and do experiments with their methods. - The connections between offline RL4RS and some of the future directions seem loose. Particularly, Section 5.2 on LLM seems to be a general direction for RL4RS, and it’s not very clear why this is important for offline RL4RS. Since the topic of this survey is offline RL4RS, it would be better to focus on it in the discussion of future directions. ## Minors: - In Eq (5), (6), (12), some subscripts seems to be in wrong format. - Personally, I feel it would be better to change \pi_{t+1} in Figure 2(b) to \pi_{t+n} to show that off-policy RL4RS is not updating the policy at each time step. Otherwise it is hard to distinguish it with on-policy RL4RS