Meta Review of Paper674 by Area Chair CZGq ACL ARR 2023 December Paper674 Area Chair CZGq 03 Feb 2024ACL ARR 2023 December Paper674 Meta ReviewReaders: Paper674 Senior Area Chairs, Paper674 Area Chairs, Paper674 Authors, Paper674 Reviewers Submitted, Program ChairsShow Revisions Metareview: This paper introduces an evaluation protocol for assessing the ability of Large Language Models (LLMs) to emulate human behaviors in conversational recommendation tasks. The protocol comprises five subtasks; through experiments, the authors reveal insightful gaps between LLM simulators and human behaviors. The work is the first to propose such an evaluation protocol, offering interesting insights for the development of advanced evaluation methods in the domain of conversational recommendation. Summary Of Reasons To Publish: There are strong engagements between authors and reviewers in discussing this paper. All reviewers agree that the proposed study is novel and interesting and the approach is reasonable. The findings are interesting as well although more support is needed through diverse and big datasets. Summary Of Suggested Revisions: The authors have provided a good summarize of the weaknesses (which are much longer than the summarized strengths in terms of number of words). It is a good summary and the justification are reasonable. On the other hand, some specific points listed by reviewers are very valid points and shall be addressed carefully, for example, the high and low frequence movies in the study. Overall Assessment: 4 = There are minor points that may be revised Suggested Venues: NAACL, ACL Best Paper Ae: No Ethical Concerns: Nil Needs Ethics Review: No Author Identity Guess: 1 = I do not have even an educated guess about author identity. Great Reviews: NcYT, ubeZ Add [–] Response to all reviewers ACL ARR 2023 December Paper674 AuthorsSe-eun Yoon(privately revealed to you) 25 Jan 2024ACL ARR 2023 December Paper674 Official CommentReaders: Paper674 Reviewers, Paper674 Reviewers Submitted, Paper674 Authors, Program Chairs, Paper674 Senior Area Chairs, Paper674 Area ChairsShow Revisions Comment: We appreciate your time, effort, and valuable feedback. Here we summarize our paper, the main points addressed by the reviewers, and our response. We provide a more detailed response to each of the reviewers in the comments. Paper summary: We propose the first protocol that evaluates LLMs as user simulators in conversational recommender systems (CRSs). The protocol consists of five tasks – each measures a key ability that a user simulator should exhibit. All tasks run automatically, comparing a large population of simulator behavior to real user datasets. By running these tasks, we show discrepancies of simulator behavior from real user behavior and present potential fixes to mitigate the discrepancies. Strengths (commonly mentioned by the reviewers): Our problem is important and novel, and the proposed method is reasonable. We present multiple findings, uncovering the issues of current LLMs as user simulators. These findings are useful for the further development of simulators. Weaknesses (commonly mentioned by the reviewers): The scale and scope of the datasets are limited. In particular, our datasets are focused on movie recommendations. More datasets could have been used for our tasks. We acknowledge that indeed, our paper uses only movie datasets. This is mainly due to the availability of datasets in the field of conversational recommendation. Currently, public datasets on real (non-synthetic) conversational recommendations are predominantly on movies [1]. Datasets are either crowdsourced or web-scraped: both methods are tricky and expensive. Crowdsourcing requires synchronizing two workers at the same time per dialogue. The workers should both have basic knowledge of the domain, and should follow a demanding set of instructions [2]. Movie recommendation is relatively easier, since it requires less expertise than other domains. Scraping requires extensive data cleansing, such as parsing item names and discerning whether they are recommended or not [3]. As such, many notable work on conversational recommender systems (that uses real datasets) are on movie recommendations [2, 3, 4, 5]. There are also reviewer-specific questions and concerns. We provide detailed responses to each reviewer in the comments. References Hongming Gao, Wenqiang Lei, Xiangnan He, Maarten de Rijke, and Tat-Seng Chua. 2021. Advances and challenges in conversational recommender systems: A survey. AI Open. Raymond Li, Samira Ebrahimi Kahou, Hannes Schulz, Vincent Michalski, Laurent Charlin, and Chris Pal. 2018. Towards deep conversational recommendations. In NeurIPS. Zhankui He, Zhouhang Xie, Rahul Jha, Harald Steck, Dawen Liang, Yesu Feng, Bodhisattwa Prasad Majumder, Nathan Kallus, and Julian McAuley. 2023. Large language models as zero-shot conversational recommenders. In CIKM. Xiaolei Wang, Kun Zhou, Ji-Rong Wen, Wayne Xin Zhao. 2022. Towards Unified Conversational Recommender Systems via Knowledge-Enhanced Prompt Learning. In KDD. Qibin Chen, Junyang Lin, Yichang Zhang, Ming Ding, Yukuo Cen, Hongxia Yang, and Jie Tang. 2019. Towards Knowledge-Based Recommender Dialog System. In EMNLP. Add [–] Official Review of Paper674 by Reviewer NcYT ACL ARR 2023 December Paper674 Reviewer NcYT 23 Jan 2024 (modified: 29 Jan 2024)ACL ARR 2023 December Paper674 Official ReviewReaders: Program Chairs, Paper674 Senior Area Chairs, Paper674 Area Chairs, Paper674 Reviewers Submitted, Paper674 AuthorsShow Revisions Paper Summary: This paper introduces an evaluation protocol for measuring LLMs' abilities to emulate human behaviors in the conversational recommendation task. The protocol consists of five subtasks: ItemsTalk (choosing items to talk about), BinPref (expressing binary preference), OpenPref (expressing open-ended preference), RecRequest (requesting recommendations), and Feedback (giving feedback). Experiment results on the five aspects provide interesting insights into the gaps between LLM simulators and human behaviors, and therefore point out potential future research directions to align these simulators better with real users. Summary Of Strengths: The idea of evaluating how LLMs can simulate real human users in conversation recommendation systems is novel and interesting. The categorization of evaluation is reasonable, and represents different aspects of alignment with human behaviors. Experiment results are meaningful, revealing a number of issues with current LLMs as user simulators: Simulators are generally less diverse Simulators do not represent user preferences well Simulators express preferences differently Simulators are not coherent enough to provide feedback Simulators sometimes fail to capture nuances in recommendation requests The paper provides in-depth analysis of experiment results, meaningful visualizations, as well as provide details (such as prompts used) for reproducibility. Summary Of Weaknesses: One of the major questions that I have would be why choosing the 5 subtasks? I feel like it would be reasonable to look at previous works on analyzing user behaviors and see if the categorization of behavior analysis aligns with the 5 categories. Personally, I feel like several of these 5 aspects can be combined to make the categorization more reasonable. For instance, 'ItemsTalk' can be combined with 'RecRequest', since RecRequest basically adds more descriptive details on top of the item chosen for recommendation. Similarly, BinPref and be combined with OpenPref into a general Preference category. Experiments and analysis in this paper seem to be limited to the 'movie recommendation' task. I am not sure if such observation is generalizable to other conversational recommendation tasks. Therefore, the scope of analysis is too narrow for the conclusion made on the relationship between LLM simulators and human users in general. Comments, Suggestions And Typos: As mentioned, consider reframing the paper to be specifically on the movie recommendation task, since all the dataset utilized in experiments are about movie recommendations and reviews. Soundness: 4 = Strong: This study provides sufficient support for all of its claims/arguments. Some extra experiments could be nice, but not essential. Overall Assessment: 3.5 Confidence: 4 = Quite sure. I tried to check the important points carefully. It's unlikely, though conceivable, that I missed something that should affect my ratings. Best Paper: No Ethical Concerns: No ethical concerns. Needs Ethics Review: No Reproducibility: 4 = They could mostly reproduce the results, but there may be some variation because of sample variance or minor variations in their interpretation of the protocol or method. Datasets: 3 = Potentially useful: Someone might find the new datasets useful for their work. Software: 3 = Potentially useful: Someone might find the new software useful for their work. Author Identity Guess: 1 = I do not have even an educated guess about author identity. Add [–] Response to reviewer NcYT ACL ARR 2023 December Paper674 AuthorsSe-eun Yoon(privately revealed to you) 25 Jan 2024ACL ARR 2023 December Paper674 Official CommentReaders: Program Chairs, Paper674 Senior Area Chairs, Paper674 Area Chairs, Paper674 Reviewers Submitted, Paper674 AuthorsShow Revisions Comment: Thank you for your valuable feedback. We appreciate your time and effort. We address your questions as follows. One of the major questions that I have would be why choosing the 5 subtasks? I feel like it would be reasonable to look at previous works on analyzing user behaviors and see if the categorization of behavior analysis aligns with the 5 categories. Personally, I feel like several of these 5 aspects can be combined to make the categorization more reasonable. For instance, 'ItemsTalk' can be combined with 'RecRequest', since RecRequest basically adds more descriptive details on top of the item chosen for recommendation. Similarly, BinPref and be combined with OpenPref into a general Preference category. Thank you for the suggestion! Since our tasks are split into five, we can flexibly group the tasks into broader themes. The reason why we chose the tasks in a more finer-grained way is that each task measures a specific ability independent from other tasks. For example, we could combine T2 (BinPref) and T3 (OpenPref) into a single task called "Pref" and measure how simulators have different preferences from humans. But we may want to examine more closely: Are simulators failing to represent human preference distribution (BinPref) or are they simply failing to express preferences in the same mannerisms as humans do (OpenPref)? This is an important question, since the design of conversational recommender systems often depends on assumptions on how the simulator will express preferences [4, 5, 6]. We could combine T1 (ItemsTalk) and T4 (RecRequest) into a single task called "Diversity". However, we might want to ask the following question: If simulators are generating less diverse requests, is it because they are incapable of mentioning diverse items (ItemsTalk), or because they have limited ability to generate a specific, personalized request from a given set of items (RecRequest)? This question is important, since for humans, even if they bring up the same item in a conversation, they talk differently about the same item. These questions can be answered when the tasks are split into measuring specific abilities. We will make such design choices clearer in the revised paper. Experiments and analysis in this paper seem to be limited to the 'movie recommendation' task. I am not sure if such observation is generalizable to other conversational recommendation tasks. Therefore, the scope of analysis is too narrow for the conclusion made on the relationship between LLM simulators and human users in general. As mentioned, consider reframing the paper to be specifically on the movie recommendation task, since all the dataset utilized in experiments are about movie recommendations and reviews. You have raised a valid point. Currently, in the conversational recommendation community, real user datasets, publicly available, are few and concentrated on movies [1, 7]. Datasets are few because collecting them is tricky and expensive: it requires crowdsourcing two workers at the same time per dialogue, and the workers should both have basic knowledge of the domain, and they should follow a demanding set of instructions [2]. Datasets are focused on movies because it requires relatively little expertise compared to other domains and item names are standardized. Scraping datasets from the web is also very challenging, as posed by the authors of the Reddit dataset [7]. Particularly, parsing the item names is difficult, and movie names are relatively easy to parse compared to other items. As such, many notable papers on conversational recommendation (that use real dataset-based evaluation) are solely on movie recommendation [1, 2, 3, 7]. While T2 and T3 can incorporate non-conversational (ratings and reviews) datasets from other domains, T4 and T5, which are critical to simulator evaluation, would still be limited to the movie datasets. Our protocol can easily incorporate other datasets, once they are available. Add [–] Response to reviewer NcYT (References) ACL ARR 2023 December Paper674 AuthorsSe-eun Yoon(privately revealed to you) 25 Jan 2024ACL ARR 2023 December Paper674 Official CommentReaders: Program Chairs, Paper674 Senior Area Chairs, Paper674 Area Chairs, Paper674 Reviewers Submitted, Paper674 AuthorsShow Revisions Comment: References Xiaolei Wang, Kun Zhou, Ji-Rong Wen, Wayne Xin Zhao. 2022. Towards Unified Conversational Recommender Systems via Knowledge-Enhanced Prompt Learning. In KDD. Qibin Chen, Junyang Lin, Yichang Zhang, Ming Ding, Yukuo Cen, Hongxia Yang, and Jie Tang. 2019. Towards Knowledge-Based Recommender Dialog System. In EMNLP. Raymond Li, Samira Ebrahimi Kahou, Hannes Schulz, Vincent Michalski, Laurent Charlin, and Chris Pal. 2018. Towards deep conversational recommendations. In NeurIPS. Wenqiang Lei, Gangyi Zhang, Xiangnan He, Yisong Miao, Xiang Wang, Liang Chen, and Tat-Seng Chua. 2020b. Interactive path reasoning on graph for conversational recommendation. In KDD. Wenqiang Lei, Xiangnan He, Yisong Miao, Qingyun Wu, Richang Hong, Min-Yen Kan, and Tat-Seng Chua. 2020a. Estimation-action-reflection: Towards deep interaction between conversational and recommender systems. In WSDM. Konstantina Christakopoulou, Filip Radlinski, and Katja Hofmann. 2016. Towards conversational recommender systems. In KDD. Zhankui He, Zhouhang Xie, Rahul Jha, Harald Steck, Dawen Liang, Yesu Feng, Bodhisattwa Prasad Majumder, Nathan Kallus, and Julian McAuley. 2023. Large language models as zero-shot conversational recommenders. In CIKM. Add [–] Response to the Authors ACL ARR 2023 December Paper674 Reviewer NcYT 29 Jan 2024ACL ARR 2023 December Paper674 Official CommentReaders: Program Chairs, Paper674 Senior Area Chairs, Paper674 Area Chairs, Paper674 Reviewers Submitted, Paper674 AuthorsShow Revisions Comment: Thanks for the detailed response and clarification. I will adjust my scores accordingly. Good luck! Add [–] Thank you ACL ARR 2023 December Paper674 AuthorsSe-eun Yoon(privately revealed to you) 30 Jan 2024ACL ARR 2023 December Paper674 Official CommentReaders: Program Chairs, Paper674 Senior Area Chairs, Paper674 Area Chairs, Paper674 Reviewers Submitted, Paper674 AuthorsShow Revisions Comment: We are glad that you find our work interesting. Thank you again for the valuable feedback. Add [–] Official Review of Paper674 by Reviewer KWgR ACL ARR 2023 December Paper674 Reviewer KWgR 22 Jan 2024ACL ARR 2023 December Paper674 Official ReviewReaders: Program Chairs, Paper674 Senior Area Chairs, Paper674 Area Chairs, Paper674 Reviewers Submitted, Paper674 AuthorsShow Revisions Paper Summary: This work proposes the first evaluation protocol for LLM-based user simulation in conversational recommendation. The exploration and findings are insightful for developing advanced evaluation methods. Summary Of Strengths: The evaluation protocol provides valuable insights for CRS development. The formulation of the five tasks is essential for evaluating a conversational recommender system. The experiments are robust and extensive, with convincing findings that are useful for further development of evaluation methods. Summary Of Weaknesses: How well does the LLM-based simulator align with human preferences? Does the extent of alignment affect the evaluation? Figure 3 shows the relationship between the item rating and how the simulator accepts the recommended items. It seems that the authors only listed 200 items in the figure. Could the authors conduct a statistical analysis and report the average distance between the actual rating and the acceptance score, such as Recall@1 score? Apart from diversity and coherence, I believe that "How well do simulators reflect human preferences?" is more important. Comments, Suggestions And Typos: As mentioned in Appendix A, 400 movies are used from the MovieLens dataset. Why consider only extremely high-frequency and low-frequency movies? Why not randomly sample around 1000 movies from MovieLens? This way, you can explore movies across all ranges of frequency. Soundness: 3.5 Overall Assessment: 3 = Good: This paper makes a reasonable contribution, and might be of interest for some (broad or narrow) sub-communities, possibly with minor revisions. Confidence: 3 = Pretty sure, but there's a chance I missed something. Although I have a good feel for this area in general, I did not carefully check the paper's details, e.g., the math or experimental design. Best Paper: No Ethical Concerns: None Needs Ethics Review: No Reproducibility: 4 = They could mostly reproduce the results, but there may be some variation because of sample variance or minor variations in their interpretation of the protocol or method. Datasets: 1 = No usable datasets submitted. Software: 1 = No usable software released. Author Identity Guess: 1 = I do not have even an educated guess about author identity. Add [–] Response to reviewer KWgR ACL ARR 2023 December Paper674 AuthorsSe-eun Yoon(privately revealed to you) 25 Jan 2024 (modified: 25 Jan 2024)ACL ARR 2023 December Paper674 Official CommentReaders: Program Chairs, Paper674 Senior Area Chairs, Paper674 Area Chairs, Paper674 Reviewers Submitted, Paper674 AuthorsShow Revisions Comment: Thank you for your valuable feedback and your time and effort to review our paper. We address your questions as follows: How well does the LLM-based simulator align with human preferences? Does the extent of alignment affect the evaluation? Figure 3 shows the relationship between the item rating and how the simulator accepts the recommended items. It seems that the authors only listed 200 items in the figure. Could the authors conduct a statistical analysis and report the average distance between the actual rating and the acceptance score, such as Recall@1 score? Apart from diversity and coherence, I believe that "How well do simulators reflect human preferences?" is more important. Indeed, evaluating simulator alignment with human preferences is important. The goal of our second task (BinPref) is to evaluate how well simulator preference aligns with human preference. You have made a valid point that 200 items may seem small for our BinPref task. To clarify, 100 simulators responded to each of the 200 items, so Figure 3 shows a result of 20,000 responses. We show the correlation coefficient in Table 3 and obtained statistically significant results: all the p-values were less than 0.0001 except in the gpt-3.5/demographic case with p-value of 0.012. The result is that simulators poorly reflect human preferences (Finding 3), while adding personality traits to the simulators improves the alignment (Finding 4). As mentioned in Appendix A, 400 movies are used from the MovieLens dataset. Why consider only extremely high-frequency and low-frequency movies? Why not randomly sample around 1000 movies from MovieLens? This way, you can explore movies across all ranges of frequency. That is a great point! The reason we sampled high and low-frequency movies separately is to address the following question: Do LLMs show better preference alignment for high-frequency movies compared to low-frequency movies? While we expected that LLMs would perform better for high-frequency movies (as they are more likely to appear frequently in the LLM training corpus), the result is that LLMs perform similarly on these two sets (Finding 3). We find this result interesting, since LLMs tend to perform better for data points that appeared more in training [1]. We will make revisions to clarify our reason for sampling high and low frequency movies separately. We will experiment with randomly sampled movies from all ranges of frequency and add the results to our paper. References Dziri et al. 2023. Faith and Fate: Limits of Transformers on Compositionality. In NeurIPS. Add [–] Official Review of Paper674 by Reviewer GTQM ACL ARR 2023 December Paper674 Reviewer GTQM 17 Jan 2024ACL ARR 2023 December Paper674 Official ReviewReaders: Program Chairs, Paper674 Senior Area Chairs, Paper674 Area Chairs, Paper674 Reviewers Submitted, Paper674 AuthorsShow Revisions Paper Summary: The authors present a novel protocol aimed at assessing how well language models can replicate human behavior in conversational recommendation scenarios. The protocol comprises five tasks, each focusing on specific aspects a synthetic user should possess: selecting topics for discussion, expressing binary preferences, articulating open-ended preferences, soliciting recommendations, and providing feedback. By assessing baseline simulators, the authors show that these tasks effectively highlight disparities between language models and human behavior. Additionally, the evaluation offers valuable insights into mitigating these disparities through model selection and prompting strategies. Summary Of Strengths: The problem studied in this paper is interesting and novel. The proposed solution is reasonable. The results look promising. Summary Of Weaknesses: The datasets are in small scale and some of them are quite toy. The baseline methods are simple and designed by heuristics. I wonder whether there are better methods to select the baselines. Comments, Suggestions And Typos: More real-world large-scale datasets are strongly recommended. Soundness: 3 = Acceptable: This study provides sufficient support for its major claims/arguments. Some minor points may need extra support or details. Overall Assessment: 2.5 Confidence: 3 = Pretty sure, but there's a chance I missed something. Although I have a good feel for this area in general, I did not carefully check the paper's details, e.g., the math or experimental design. Best Paper: No Ethical Concerns: None Needs Ethics Review: No Reproducibility: 3 = They could reproduce the results with some difficulty. The settings of parameters are underspecified or subjectively determined, and/or the training/evaluation data are not widely available. Datasets: 1 = No usable datasets submitted. Software: 1 = No usable software released. Author Identity Guess: 1 = I do not have even an educated guess about author identity. Add [–] Response to reviewer GTQM ACL ARR 2023 December Paper674 AuthorsSe-eun Yoon(privately revealed to you) 25 Jan 2024ACL ARR 2023 December Paper674 Official CommentReaders: Program Chairs, Paper674 Senior Area Chairs, Paper674 Area Chairs, Paper674 Reviewers Submitted, Paper674 AuthorsShow Revisions Comment: Thank you for your time and effort in reviewing our paper and providing us with valuable feedback. We appreciate your recognition of the novelty of our problem and the solution we propose. We address your concerns as follows. The datasets are in small scale and some of them are quite toy. ​​More real-world large-scale datasets are strongly recommended. We agree on the importance of using real-world datasets. We would like to clarify that in all our experiments, we have indeed used real-world, large-scale datasets (Section 3: Datasets, A.1. Dataset statistics). Dataset Data Size Content ReDial 11k conversations Movie recommendations between two humans Reddit 23k conversations Movie recommendations from Reddit MovieLens 162k users Movies rated by real users IMDB 1k users Reviews written by real users, each with 11+ reviews Compared to similar work in ICML that compares LLM-generated data to human data [1], our datasets are ten to hundred times larger in scale. The fact that we used real-world large-scale datasets may not have been very clear in our paper. We will revise our paper so this important point of our paper becomes clear. The baseline methods are simple and designed by heuristics. I wonder whether there are better methods to select the baselines. We agree that our baselines are simple. They are prompt-based approaches, and there could be more sophisticated techniques, such as simulators retrieving information from external memory [2]. Although the baselines are simple, we observe in our study that simple prompt-based strategies can improve the alignment between humans and simulators (Section 4: Findings 2, 4, 5). References Gati V Aher, Rosa I Arriaga, and Adam Tauman Kalai. 2023. Using large language models to simulate multiple humans and replicate human subject studies. In ICML. Joon Sung Park, Joseph C O’Brien, Carrie J Cai, Mered- ith Ringel Morris, Percy Liang, and Michael S Bern- stein. 2023. Generative agents: Interactive simulacra of human behavior. In UIST. Add [–] Official Review of Paper674 by Reviewer ubeZ ACL ARR 2023 December Paper674 Reviewer ubeZ 16 Jan 2024ACL ARR 2023 December Paper674 Official ReviewReaders: Program Chairs, Paper674 Senior Area Chairs, Paper674 Area Chairs, Paper674 Reviewers Submitted, Paper674 AuthorsShow Revisions Paper Summary: This paper proposes a protocol for measuring how well LLM-based simulators can mimic human behavior in the application domain of conversational recommendation. Such protocol is centered on five tasks---ItemsTalk (T1), BinPref (T2), OpenPref (T3), RecRequest (T4), and Feedback (T5)---which are evaluated over populations of users. This way, the paper proposes five task-specific evaluation methods, instantiates them using subsets of four well-known datasets---ReDial, Reddit, MovieLens, and IMDB---in combination with prompting-based methods, and then discusses findings for each task. Authors' goals are two-fold: designing a protocol capable of measuring the discrepancies between LLM-based simulators and real humans in conversational recommendation spanning different user intents; and releasing this protocol as a benchmark for other works that increasingly rely on LLM-based simulation. Summary Of Strengths: The paper is well motivated and pursues important goals. The five different tasks are interesting, measuring aspects such as item diversity (T1), the alignment of structured preferences (T2), and more challenging open-ended interactions (T3-5). Findings obtained for T1 align with contemporary discussions outside of the conversational recommendation space and may contribute to such discussions. Works such as "When Not to Trust Language Models: Investigating Effectiveness of Parametric and Non-Parametric Memories" [https://aclanthology.org/2023.acl-long.546/] have analyzed the behaviors of LLMs' parametric vs. non-parametric memories w.r.t. popular concepts. Findings 1 & 2 seem to connect to these behaviors, but with specific implications in the conversational recommendation space. The paper is generally well-written. Summary Of Weaknesses: The soundness of the methods varies across tasks. On one hand, T1 is evaluated over three datasets, measuring a property (item diversity) that is more amenable to a fully automated method; on the other hand, T2-5 are only evaluated on a single dataset and in ways that are less amenable to full automation. For example, T3 fully relies on the outputs of PyABSA, which may be error-prone; to make the evaluation of T3 sound, authors could have selected a sample and performed some degree of human validation. Consequently, the reliability of the findings also varies a lot: while findings 1 & 2 hold for multiple datasets and can be more reliably measured in a fully automated manner, the remaining findings are only supported by a single data point with no human(s) in the loop to validate the methodological choices. While the first goal seems to be at least partially achieved by the paper, its current limitations could hurt its ability to become an effective benchmark. Comments, Suggestions And Typos: Questions: Have you measured the error/noise in the following assumptions: a) The use of PyABSA in T3 (see above for context). b) The random sampling of "negative recommendations" in T5. An alternative explanation for lines 427-430 could be that, to some extent, randomly sampled recommendations are actually appropriate, especially in cases of generic requests (e.g., "Inspirational movies" or "Impactful endings" as seen in Figure 6). What do authors mean by "well-rounded" in line 395? Presentation: Lines 203-206 seem a bit arbitrary. Some justification would be beneficial. Lines 472-496 could benefit from more proof-reading. Soundness: 2.5 Overall Assessment: 3 = Good: This paper makes a reasonable contribution, and might be of interest for some (broad or narrow) sub-communities, possibly with minor revisions. Confidence: 4 = Quite sure. I tried to check the important points carefully. It's unlikely, though conceivable, that I missed something that should affect my ratings. Best Paper: No Ethical Concerns: No concerns for this particular work. Reproducibility: 3 = They could reproduce the results with some difficulty. The settings of parameters are underspecified or subjectively determined, and/or the training/evaluation data are not widely available. Datasets: 3 = Potentially useful: Someone might find the new datasets useful for their work. Software: 2 = Documentary: The new software will be useful to study or replicate the reported research, although for other purposes it may have limited interest or limited usability. (Still a positive rating) Author Identity Guess: 1 = I do not have even an educated guess about author identity. Add [–] Response to reviewer ubeZ (Part 2) ACL ARR 2023 December Paper674 AuthorsSe-eun Yoon(privately revealed to you) 25 Jan 2024ACL ARR 2023 December Paper674 Official CommentReaders: Program Chairs, Paper674 Senior Area Chairs, Paper674 Area Chairs, Paper674 Reviewers Submitted, Paper674 AuthorsShow Revisions Comment: Have you measured the error/noise in the following assumptions: a) The use of PyABSA in T3 (T3 fully relies on the outputs of PyABSA, which may be error-prone; to make the evaluation of T3 sound, authors could have selected a sample and performed some degree of human validation.) That’s a valid point. Any software, including PyABSA, may have errors; human validation would uncover the noise in the results. Since we are comparing the disparities of humans and LLMs, our approach was to apply the same software on both human and LLM responses. Therefore, any software-specific limitations (e.g., false positives and false negatives) would be applied both for human and LLM responses. In this way, we can uncover that the human and LLM responses are different, since they were run on the same software but yielded different results. b) The random sampling of "negative recommendations" in T5. An alternative explanation for lines 427-430 could be that, to some extent, randomly sampled recommendations are actually appropriate, especially in cases of generic requests (e.g., "Inspirational movies" or "Impactful endings" as seen in Figure 6). Also a valid point. We are aware of the possible issue of noise in negative sampling. Let us first consider the potential error rate, and then discuss the strategies we used to mitigate the potential noise. The Reddit dataset contains 23k requests. We ask, what is the possibility that a selected request (e.g., "Inspirational movies") is assigned a negative sample from a request that may have interchangeable movies (e.g., "Impactful endings")? (As in Figure 5, most requests are more specific than this, since these requests are the ones in the lowest entropy segment.) If, for a given request, there are 100 other requests whose recommendation is relevant to this request, the chance of picking this recommendation as the negative sample is 100/23k ≈ 0.4%. That said, we do not know the exact error rate. For this reason, we used the following strategies to mitigate potential noise. First, the original Reddit dataset contains multiple comments (recommendations) per request. When we perform negative sampling, we make sure the negative sample is not included in the entire set of these recommended items. Second, we designed two subtasks (accept/reject and comparison). Particularly, the comparison task asks which of the two is the better recommendation. Even if the negative movie is somewhat relevant, it is not likely to be more relevant than the positive movie. This further reduces the possible noise of the results. As you said, a thorough analysis of error rate would be indeed helpful. A better negative sampling scheme (e.g., sampling from requests that have a large distance from the original request) would further guarantee the accuracy of the observation. What do authors mean by "well-rounded" in line 395? Thanks for pointing out the unclear term. Here is an improved writing: LLMs make generic requests. In the example of the request about movies like Joker (2019), the human user was interested in the "loneliness" of the character, which is one detail of the movie; others may have paid attention to a different detail. In contrast, the LLM simulator finds the movie interesting because it is a "psychological thriller," which is a generic description. We will edit this part in the revision. Lines 203-206 seem a bit arbitrary. Some justification would be beneficial. Thanks, we will add the following justification to the lines 203-206 (Tasks should be zero-shot; simulators should not be trained or conditioned on our tasks, nor be informed about our evaluation metrics.): This is to avoid simulators fitting to the tasks instead of performing well in generic situations. Lines 472-496 could benefit from more proof-reading. Yes, we will improve it into better writing instead of simply enlisting the previous work. References Hongming Gao, Wenqiang Lei, Xiangnan He, Maarten de Rijke, and Tat-Seng Chua. 2021. Advances and challenges in conversational recommender systems: A survey. AI Open. Raymond Li, Samira Ebrahimi Kahou, Hannes Schulz, Vincent Michalski, Laurent Charlin, and Chris Pal. 2018. Towards deep conversational recommendations. In NeurIPS. Zhankui He, Zhouhang Xie, Rahul Jha, Harald Steck, Dawen Liang, Yesu Feng, Bodhisattwa Prasad Majumder, Nathan Kallus, and Julian McAuley. 2023. Large language models as zero-shot conversational recommenders. In CIKM. Add [–] Thank you ACL ARR 2023 December Paper674 Reviewer ubeZ 29 Jan 2024 (modified: 29 Jan 2024)ACL ARR 2023 December Paper674 Official CommentReaders: Program Chairs, Paper674 Senior Area Chairs, Paper674 Area Chairs, Paper674 Reviewers Submitted, Paper674 AuthorsShow Revisions Comment: I would like to thank the authors for their thoughtful response. Despite the remaining concerns, I think that the authors are in a promising direction, and I would encourage them to validate some of their stronger assumptions by including humans in the loop, even if only covering small samples. Especially in this work, finding correlations with human evaluation and manageable error margins would facilitate the adoption of their ideas as a framework for others to follow. Secondarily but still importantly, the authors may be able to leverage datasets from a second domain besides Movies: Penha et al. (2020) [1][2] extracted extensive data from the Web about Books and Music, including requests from Reddit (refer to subsection 4.1 in their paper); Chaganty et al. (2023) [3][4] designed a smaller dataset about Music in a "Wizard-of-Oz style," including human-generated requests. Especially in the full paper format and with a subject that aims at generalizability (i.e., the applicability of LLMs in CRS is not restricted to Movies), it would be important to investigate a second domain. [1] https://dl.acm.org/doi/abs/10.1145/3383313.3412249 [2] https://github.com/Guzpenha/ConvRecProbingBERT/tree/master [3] https://arxiv.org/pdf/2303.06791.pdf [4] https://github.com/google-research-datasets/cpcd Add [–] Thank you for your feedback ACL ARR 2023 December Paper674 AuthorsSe-eun Yoon(privately revealed to you) 30 Jan 2024ACL ARR 2023 December Paper674 Official CommentReaders: Program Chairs, Paper674 Senior Area Chairs, Paper674 Area Chairs, Paper674 Reviewers Submitted, Paper674 AuthorsShow Revisions Comment: We appreciate your valuable feedback. Human validation would be helpful to strengthen our observations. Also, thank you for the pointers to relevant work. The music Reddit dataset would need further processing for our tasks (e.g., filtering requests from non-requests, music entity extraction), but the Wizard-of-Oz music dataset could be easier to process. Indeed, a second domain would enhance the generalizability of the results. Add [–] Response to reviewer ubeZ (Part 1) ACL ARR 2023 December Paper674 AuthorsSe-eun Yoon(privately revealed to you) 25 Jan 2024ACL ARR 2023 December Paper674 Official CommentReaders: Program Chairs, Paper674 Senior Area Chairs, Paper674 Area Chairs, Paper674 Reviewers Submitted, Paper674 AuthorsShow Revisions Comment: Thank you for your thorough and valuable feedback. We appreciate your acknowledgement of this problem's importance and the suggestions you have provided. We address your concerns and questions as follows. T2-5 are only evaluated on a single dataset Adding more datasets would indeed be helpful to show more observations. For example, e-commerce recommendations may show different disparity patterns across tasks. The reason we put a single dataset per task was because the gain that we would obtain would be small compared to the extensive process required to collect more datasets. We elaborate as follows. The current available datasets in conversational recommendation, while few, are heavily focused on movies [1]. We have written the reason in our common response. T4 and T5 requires real recommendation request datasets with parsable item names. To the best of our knowledge, only the Reddit dataset satisfies this property, and the dataset authors have gone through an expensive process to provide data [3]. While it is possible to parse other Reddit communities (e.g., books, music) in a similar manner, doing so would require the same costly process. We may think of using the ReDial dataset. However, ReDial is not centered around requesting movies. For example, many conversations are chit-chat (i.e., simply talking about movies), and there are no labels that specify which utterance is a request. We thought of designing heuristics to extract requests (e.g., parsing sentences that start with "recommend me"), but the utterances are typically very short (word count: avg 6.77, std 5), which would make the requests less generic. This would cause large noise in the negative sampling process (which you pointed out in a later comment). For T2 and T3, the conditions are less restrictive. It is indeed possible to find ratings and reviews datasets from another domain (e.g., books). While we initially thought of unifying the domain to movies across all tasks, we could easily add other domains to T2 and T3 and see the results. We expect advancement in the field of conversational recommendation, leading to the availability of more datasets across various domains. Integrating these datasets into our protocol should be straightforward. and in ways that are less amenable to full automation. Full automation is indeed the fundamental goal of our work: we propose an automatic evaluation for user simulators, which in turn automatically evaluates conversational recommender systems. All our tasks could be fully automated, but indeed, T2-T5 relies on external software (e.g., PyABSA, SBERT) or techniques (e.g., negative sampling) that may have their own errors and may require human validation. We will discuss the reliability of these software below, but before that, we can view this issue in a more general sense: user simulation is an automated proxy of human users. In the end, it would always be more accurate to validate on humans. Add [–] Supplementary Materials by Program Chairs ACL ARR 2023 December Program Chairs 16 Dec 2023ACL ARR 2023 December Paper674 Supplementary MaterialsReaders: Program Chairs, Paper674 Reviewers, Paper674 Authors, Paper674 Area Chairs, Paper674 Senior Area ChairsShow Revisions Responsible NLP Research: pdf Note From EiCs: These are the confidential supplementary materials of the submission. If you see no entries in this comment, this means there haven't been submitted any.