BEGINNING OF COMMENTS TO THE AUTHOR(S) +++++++++++++++++++++++++++++++++++++++ Recommended Decision by Associate Editor: Recommendation #1: Reject & Resubmit Recommendation #2: Reject & Resubmit Comments to Author(s) by Associate Editor: Senior Editor: 1 Comments to the Author: (There are no comments. Please check to see if comments were included as a file attachment with this e-mail or as an attachment in your Author Center.) Associate Editor: 2 Comments to the Author: This paper proposes a reinforcement learning method for recommendation with sparse feedback. The reviewers have raised multiple concerns about the motivations and claims, technical clarity, and experiments. The authors are suggested to conduct a thorough revision. +++++++++++++++++++++ Individual Reviews: Reviewer(s)' Comments to Author(s): Reviewer: 1 Comments to the Author Comments: This paper proposes a general Model-Agnostic Counterfactual Synthesis (MACS) Policy for counterfactual user interaction data augmentation and integrates the proposed policy Deep Deterministic Policy Gradient (DDPG), Soft Actor-Critic (SAC), and Twin Delayed DDPG in an adaptive pipeline with a recommendation agent that can generate counterfactual data to improve the performance of recommendation. There are still some following issues that can be considered to further improve this manuscript. 1. Why the author says “One of the main obstacles for the RL-based recommender system is that it will struggle to precisely grasp users’ preferences and generate suitable recommendations only with limited interaction data.”? As the reviewer knows, data-driven methods can make a personal result. 2. The limited interaction data is a special problem in offline reinforcement learning, but the author uses some online reinforcement learning methods such as Deep Deterministic Policy Gradient (DDPG), Soft Actor-Critic (SAC), and Twin Delayed DDPG, can these methods use in offline data without distribution shifting problem? 3. The reviewer is curious about the reward function $f_R$ in the recommendation system in equation (6). 4. There is something wrong with the pseudocode, please have the author check this. 5. The author should add the computing method of $\Delta_1$ and $\Delta_2$. 6. The author can compare the proposed method with SOTA to better show the improvement of the Model-Agnostic Counterfactual Synthesis Policy and can the proposed method be used in some SOTA rather than DDPG? Reviewer: 2 Comments to the Author The overall writing is clear and fluent. But some descriptions should be improved. And additional results in experiment section should be provided to further prove the effectiveness of the proposed method. About more details, see attached files. Reviewer: 3 Comments to the Author The paper proposes a model-agnostic counterfactual synthesis policy for augmentation to solve the sparsity issue in recommender systems. To be specific, a two-step learning framework is proposed, which separately training policy generator and data augmentation. Strengths: • The paper is well-structured and easy to follow. • The augmentation method that employs model-agnostic policy is a novel approach. Weaknesses: • The logical flow in the introduction should be improved. For example, the authors should explain what’s casualty and counterfactual in details. By the way, I suggest the authors link the relationship between casualty, the concept of counterfactual, and the proposed data augmentation methods. • The authors state that KL divergence is used to learn counterfactual distributions. But the computation of KL divergence is known to be unstable. The authors should provide the training process and the analysis of KL in the model, at least a qualitative analysis. • The authors should provide some technical details in experiments. For example, how to train the augmentation policy? What hyper-parameters are used? • As the proposed policy aims to solve the sparsity problem, please provide an analysis on dataset sparsity before and after the use of the policy. • The models seem to be sensitive to batch size in Fig. 5. When the batch size is set to 256, the model variance is significantly large, whereas the model variance is small at 64 batch size. The observation is different from my knowledge that a small batch size may lead to large uncertainties. The results make the model inconvincible, especially for real-world scenarios that requires a large batch size. Please explain why the situation happens. • In table I, the variance of baseline models on MovieLens datasets is very large. The variance of accuracy of DDPG on MovieLens-100k is approximates to 15%. However, after applying the synthesis model, the variances are significantly reduced. Please explain why. • The authors should compare the proposed method with other counterfactual augmentation-based baselines, e.g., “Sample-efficient reinforcement learning via counterfactual-based data augmentation”. • I suggest the authors to follow the writing code by adding a prefix before a single reference at the beginning of a sentence, e.g., “Ref. [14] propose to measure the similarity between …”. +++++++++++++++++++++++++++++++++++++++ In addition, the paper needs to seriously address the concern on the completeness of literature review of the related subjects and the concern on its suitability for the journal - has TNNLS ever published a single paper on the topic described in this manuscript before? If yes, why there is no mentioning of such work(s) throughout the entire manuscript? if no, then why this type of topic would be of interest to the readers of the journal? This further creates the concern on the suitability of the paper for the journal.