Primary SPC (meta-reviewer) review (reviewer 5) score 4/5 The Meta-Review The reviewers generally found the presented work novel and interesting. They also appreciated the in-depth experimental analysis, noting however that only one baseline was used in the experiments. The presentation was generally considered to be good, but also listed a number of aspects that should be further improved. Overall, despite certain limitations, the reviewers see this paper as a possible candidate for acceptance. ---------------------------------------------------------------- Secondary SPC (SPC reviewer) review (reviewer 3) score 3/5 Expertise Expert Summary of Contributions The paper proposes a technique for navigation through multi-step critiquing. The main novelty is that using a simulated dialogue with an agent trained to represent the user's preferences, the critiquing model is trained to handle critiques as effectively as possible. Strengths 1. The idea is interesting and combines work in natural language dialogue with recent work on critiquing in recommendation. 2. The experimental results show dramatic improvements over critiquing baselines. 3. Critiquing is a useful technique for recommendation that merits more attention Weaknesses 1. The baseline used for comparison, CE-VAE, is not the strongest baseline for multi-turn critiquing, M&Ms VAE (ref. [1]) is known to be stronger. 2. It seems that the experimental evaluation uses the same seeker model as the bots used to train the new technique. Thus, there is a connection from training data to the evaluation which means that the reported results are likely to be too good. 3. The LLC method used for critiquing is known to have very high computational overhead while the CE-VAE method and in particular the M&Ms VAE trades expressivity for much higher efficiency; nothing is reported about the computation time required by the different methods. Detailed Review The paper is interesting and novel, but this version contains too many claims that are not justified. For example, Table 1 claims that none of the conversational critiquing techniques can deal with multi-turn conversations, but M&Ms VAE has been developed for and evaluated on just that. Also, M&Ms VAE is known to be a much stronger baseline than CE-VAE for multi-turn critiquing; why was this not used instead? The method in Antognini et al: Interacting with explanations through critiquing might be more suitable for the bot framework as it could accommodate new key phrases more easily. Computational complexity is a big issue here. Critiquing models have to be computed for each user individually and are thus difficult to scale to thousands of users. What is the computation time required to process each critique for the different methods? And what is the time required for the simulated conversations with the bots? The most worrisome aspect is that the new techniques proposed in the paper are trained with the same seeker model that is used in the experiments. Even if the seeker model is applied on different users - which is not entirely clear in the paper - the recommenders are still trained to perform well under this particular seeker model. This interaction between training and testing scenarios is likely to overestimate the true performance when an actual user critiques according to a different process. This interaction would have to be explored further to make the experimental results credible. Human Participants Research Does not report on human-participants research Ethical considerations (blank) Review Rating Borderline: Overall I would not argue for accepting this paper. ---------------------------------------------------------------- PC review (reviewer 1) score 4/5 Expertise Knowledgeable Summary of Contributions This paper proposed a two-part framework (classic recommendation models + self- supervised bot-play) for training multi-turn conversational recommender systems, which can generate recommendation rationales (explanations) for users to critique the recommendation and then improve the recommendation. They conducted extensive offline experiments on three different real-world datasets to demonstrate the effectiveness of the proposed framework. They also conducted a user study to evaluate their model in cold-start settings. This work will be valuable for researchers and practitioners who studied conversational recommender systems. Strengths 1. Proposed a framework (recommendation techniques + self-supervised bot- play) to train conversational recommender systems using product review data. 2. Demonstrate the usefulness of the proposed framework using two classic recommendation models (BPR and PLRec). 3. Conducted extensive experiments (offline experiments + human evaluation + user studies) to evaluate the framework. Weaknesses 1. The motivation for choosing self-supervised bot-play could be better articulated. 2. There is some confusion regarding the notation. It is better to describe the used notation in a table. 3. Lack of discussion of whether an ethical review of human participants’ research was necessary and whether it was conducted. Detailed Review This paper addressed an interesting and timely topic – conversational recommender systems by contributing a two-part framework (classic recommendation models + self-supervised bot-play) for training multi-turn conversational recommender systems. Their extensive offline experiments and user studies have demonstrated the effectiveness of the proposed framework. This will be of interest and valuable for researchers and practitioners who studied conversational recommender systems. Positive: 1. The author proposed an effective framework for training conversational recommender systems, showing several advantages: (1) provide justification of recommendations for users to effectively give feedback; (2) support multi-turn conversation; (3) rely only on user review data instead of dialogue transcripts. 2. The paper clearly shows the novelty of the proposed framework by comparing it with existing studies on conversational recommender systems. 3. For the method part, the paper clearly describes the proposed framework and each module using figures. 4. Regarding the experimental setting, the experimental procedure is clearly presented, with a detailed summary of the best parameter and compared methods. 5. They conducted extensive experiments to show the effectiveness of the proposed framework in terms of several aspects: a) multi-step critiquing ability and recommendation performance; b) generating useful and accurate rationales. The experimental results are presented in a good manner with Figures and Tables, and clear interpretation. 6. They further demonstrated the effectiveness (usefulness, informative, knowledgeable, and adaptive) of the framework to train conversation recommenders by involving human participants. Moreover, a user study was conducted to evaluate the usefulness of the model in a real-time cold-start setting. Negative (suggestions for revisions): 1. In the Introduction section, it is suggested to introduce what is conversational critiquing and self-supervised bot-play in the first occurrence as some researchers may not be familiar with these methods. Also, it would be better to describe the motivation for using them in the framework, e.g., what are the advantages). The motivation of the method chosen should be better articulated. 2. In line 66, the statement is not clear, “leveraging ideas from conversational critiquing systems [33] trained via next-item recommendation”. 3. Regarding the notation, I find some of them a bit confusing without a clear definition. For example, What do “A” and “k” denote in Section 3.2 and Section 3.3? Moreover, although some notations such as “U” and “V” are commonly used in the field of recommender systems, it would be better if the author can describe all used notations in the first occurrence so that other researchers can understand the formula well and replicate the experiments in their studies. I would suggest describing the used notations in a table. 4. For the experimental part, it would be better if the author could adopt a “general to specific” structure, e.g., first give a summary of the addressed research questions in the experimental study and then describe them one by one. This may make the paper easier to follow by readers. 5. As the experiment involve human participants (Section 6), it is necessary to discuss whether an ethical review of human participants’ research was necessary and whether it was conducted. Based on the contribution of this work to recommender systems, I would argue for accepting this paper. Human Participants Research CONCERN: Fails to address ethics/review of human participants research Ethical considerations The paper did not discuss whether an ethical review of human participants’ research (Section 6) was necessary and whether it was conducted. Review Rating Probably accept: I would argue for accepting this paper. ---------------------------------------------------------------- PC review (reviewer 2) score 4/5 Expertise Expert Summary of Contributions This paper presents a variant of critiquing-based conversational recommender system. The novelty comes from two folds: 1) introducing self-supervised bot play to enhance the multi-step critiquing performance, and 2) decomposing latent representation of the recommender system such that critiquing is only applied to the user embeddings (instead of model embedding in CE-VAE). The proposed system is applicable to two well known recommender systems (BPR and PLRec), where the experiments show reasonably better performance than the baseline model (CE-VAE). Strengths 1. the paper is well-written with good visualization about the proposed framework. 2. It contains enough literature view that covered most of the recent publications in the field. 3.Expeirments are interesting to read while there is only one baseline model, which is a bit disappointing. Weaknesses 1. I think the paper is over-claiming the proposed model's applicability. 2. Lack of baseline comparison. Only CE-VAE 3. Some settings are lack of explanation. Detailed Review Overall, the paper is well-written with good visualization about the proposed framework. It contains enough literature view that covered most of the recent publications in the field. Expeirments are interesting to read while there is only one baseline model, which is a bit disappointing. I think the paper is over-claiming the proposed model's applicability. E.g. I don't see how the framework is suitable for code-start setting as the backbone model is essentially matrix factorization. There are elicitation works trained for similar purpose [CITE], this paper is not the only one does it. So,when the author describe bot play in section 3.4, I was assuming authors going to compare the system with LLC or other sequential model, but no. Should have a comparison. Some settings are lack of explanation. I would suggest author to descibe them more clearly. E.g. why does the fuse function is a element-wise mean? Any reason? And justficiation? E.g. when describing each natural language rationale a in section 3.2, do you mean keywords? Some minor suggestions: 1. In consistant citation style in Table 1, where some citations have model name while others are given by author names. 2. Equation in 3.2 where k_{i,a} is an element right? why use vector notation? Human Participants Research Human participants research but would not require review Ethical considerations (blank) Review Rating Probably accept: I would argue for accepting this paper. ---------------------------------------------------------------- PC review (reviewer 4) score 4/5 Expertise Expert Summary of Contributions This paper presents a framework for training critiquing-based conversational recommender systems, which features a bot-play strategy to fine-tune the model using data collected from user-written text reviews. The framework is made up of several components: a recommender system (based on either BPR or Projected Linear Recommendation), a justification module (which also acts as a dynamic critique generation strategy), and a critiquing module. For the bot-play algorithm, a simulated user (seeker) is developed, which given a recommended item and a justification, selects the most popular rationale (i.e. aspect) from the justification as the critique. The aforementioned modules are updated through a loss function that is specific for the bot-play task. The paper describes an extensive set of experiments, which compare the proposed approach against a set of baselines based on linear critiquing, as well as an approach based on variational autoencoders. The experiment is conducted on three different datasets (Goodreads Fantasy, BeerAdvocate, Amazon CDs & Vinyl). Experimental results show that the proposed approach outperforms the baselines. Through an ablation study, the authors show that bot-play is contributing to the performance of the proposed approach. The paper also discusses two types of user study: the first involves a manual evaluation of simulated recommendation sessions, while the second involves interactions with real users. Results show that the proposed approach is able to generate useful and accurate justifications, and that it can adapt to user feedback. Strengths - The proposed approach is quite novel. - The paper describes a very extensive set of experiments, which strengthen the validity of the results. - Experimental results show that the proposed approach significantly outperforms the current state of the art. - The contribution is relevant to the RecSys community. Weaknesses - The description of the proposed approach is a bit confusing: as it stands, several passes are required to fully understand how it works, and some details are missing. - Some of the choices in the experimental protocol require more justification. Detailed Review Overall, I think that the paper contains a great deal of work, especially in terms of the experiments performed to validate the proposed approach. It makes use of some pre-existing components (such as the BPR recommender), but the bot-play approach is fairly novel. Moreover, the experiments are quite effective in demonstrating the advantages of the proposed approach. However, I have some doubts regarding the description of the proposed approach, as there are many small gaps of information in Section 3, which make it hard to appreciate the work fully. This is especially true for the many equations described in the paper, as some terms are not introduced/described properly. Some examples include: - Line 238: k_{i,a}^{I}, A; - Line 289: m_{u}^{t}; - Line 309: Ω(W). Algorithm 1 shows the calculation of the recommendation loss L_{CE}, but not that of the rationale prediction loss L_{A}. Based on the description in Section 3.4, it seems that it should be included. The authors represent a user critique as a vector of integers. This seems like a strange choice, and I think it should be explained more carefully. Moreover, the critiquing function f_{crit} is introduced, but it is not described how it is calculated: it is probably linked to the equation in line 249, but I am not sure since it uses different symbols. Regarding the description of the bot-play algorithm, the sentence in lines 324-326 is not clear. Specifically, it is not clear if there is some training involved in the seeker model: from the description, it seems to be rule-based (select the most popular rationale). Moreover, the paper mentions popularity/weakness of a rationale, but does not describe how they are calculated. It is not clear what recommendation algorithm is used in the linear critiquing baselines (UAC, BAC, LLC-score). To avoid any bias, it should be the same as the one used in the proposed approach. Please add more details regarding the baselines used in the experiment. I think that the non-bot-play baselines described in Figure 5 could be included as baselines in all experiments. This can help distinguish the contribution of each component on the overall performance. Related to this, Figure 5a could be merged with Figure 4a. In the results in Figure 7, it is not clear to me why the reported F1 scores are higher than 1. I would expect that the scores would be in the [0,1] range. Please fix this, or explain the reason behind this range. The authors report statistical significance for the human evaluations, however the specific statistical test (e.g., t-test, chi-square) is not mentioned. Please add this information to the paper. Despite these considerations, I believe that this paper may be a good contribution to RecSys. Human Participants Research Human participants research but would not require review Ethical considerations none Review Rating Probably accept: I would argue for accepting this paper. ----------------------------------------------------------------