Official Review of Paper252 by Reviewer UJdU KDD 2023 Conference Research Track Paper252 Reviewer UJdU 25 Mar 2023KDD 2023 Conference Research Track Paper252 Official ReviewReaders: Program Chairs, Paper252 Area Chairs, Paper252 Reviewers Submitted, Paper252 Authors Summary: The authors propose Unifying aspect-planning and lexical Constraints for generating Explanations in Recommendation (UCE_{PIC}), that generates high-quality personalized explanations for recommendation results by unifying aspect planning and lexical constraints in a way of an insertion-based generation. The authors conduct comparative experiments with several baselines on multiple datasets. Paper Strength: The paper is easy to follow and well-structured. The authors compare their proposed approach and previous works in a table. The authors conduct comparative experiments with six baselines on two datasets. The authors perform ablation study and caae study. Especially, case study make this work more specific. Paper Weakness: The authors need to cite the following important works on explainable recommendation: [Template-based approaches] This is close to constrained approach as template-based approach first defines some predefined explanation templates (a kind of constraints), and then personalizes them by inputting different words and features for each user. Y. Zhang, G. Lai, M. Zhang, Y. Zhang, Y. Liu, and S. Ma: "Explicit Factor Models for Explainable Recommendation based on Phrase-level Sentiment Analysis" (SIGIR 2014) J. Gao, X. Wang, Y. Wang, and X. Xie: "Explainable Recommendation through Attentive Multi-View Learning" (AAAI-19) [A work that overcomes the drawback of PETER [20]] At Lines 612 to 613, the authors describe that "This baseline can be considered as a state-of-the-art model for explainable recommendation." But the following work overcomes the drawback of PETER [20]: D. V. Hada and S. K. Shevade: "ReXPlug: Explainable Recommendation using Plug-and-Play Language Model" (SIGIR'21) In addition, the paper on AdamW is not [12] but the following: Ilya Loshchilov and Frank Hutter: "Decoupled Weight Decay Regularization" In Proc. of the 7th International Conference on Learning Representations (ICLR 2019) The authors need to carefully format "References" section. Page numbers are missing in almost all references. The authors need to explore whether the eight arXiv papers, which is about 20% of all references, have been published in official conference or journal papers. The reviewer found that [4] and [12] have been published in NAACL'19 and ICLR'15, respectively. Questions To Authors And Suggestions For Rebuttal: In this line of work, Amazon product dataset is widely used. Do you have any specific reasons why you select "RateBeer" and "Yelp" datasets, not the Amazon dataset? Technical Quality: 3: Good - The approach is generally solid with minimal adjustments required for claims that will not significantly affect the primary results Presentation: 3: Good - The paper clearly explains the technical parts, but the writing could be improved to better understand its contributions. Contributions: 2: Fair - The study may be of relevance within a specific research community. Overall Assessment: 3: Borderline Accept: The paper is technically solid but has limitations. However, the reasons to accept outweigh the reasons to reject. Confidence Level: 5: The reviewer is absolutely certain that the evaluation is correct and very familiar with the relevant literature. [–] Response to Reviewer UJdU KDD 2023 Conference Research Track Paper252 AuthorsZhankui He(privately revealed to you) 13 Apr 2023KDD 2023 Conference Research Track Paper252 Official CommentReaders: Program Chairs, Paper252 Area Chairs, Paper252 Reviewers Submitted, Paper252 Authors Comment: Thanks for the time and effort you have dedicated to reviewing our manuscript! For missing related works: We appreciate your suggestions regarding related works, including template-based approaches and the recently proposed explanation recommendation model ReXPlug. Templated-based models generally refer to (1) selecting some important features (e.g., item attributes); (2) incorporating them into pre-defined sentence structures like “you will like X because Y and Z”. However, template-based models would suffer from generating vivid natural-language-based explanations in a personalized style, and thus would likely lead to poor performance in our table 5. We believe that our proposed model can complement the attribution methods used in template-based models. Specifically, we can use the selected attributes as the initial stage of UCEpic, ensuring that they are included in the final personalized explanations. This approach allows us to leverage the strengths of both models and improve the overall quality of the explanations in the future. We will add more details in our revision. Thank you for your comment on the ReXPlug paper. We agree that the use of cross-attention networks to select important historical reviews from both the user and item perspectives has the potential to enhance the informativeness of generated explanations. However, since the ReXPlug paper did not include PETER as a baseline, it is unclear if it can effectively address the issues identified in PETER. To address this, we plan to add ReXPlug as a baseline in our revision. But, we want to emphasize that both ReXPlug and PETER are auto-regressive models that lack the ability to lexically constrain key phrases like our UCEpic. For Amazon reviews dataset: Thank you for your suggestion to include Amazon Reviews datasets in our experiments. However, we did not include them in the fine-tuning phase because we had previously conducted a comparison of pre-training on a general corpus, Wikipedia, versus recommendation-related corpus, Amazon Reviews. It was not included in this submission because we unfortunately found that generation models trained on Amazon reviews tended to over-fit and yielded worse results compared to those trained on Wikipedia. Since that experiment is not the focus of this paper, our finalized experiments only mentioned the pre-training on Wikipedia and fine-tuned results on RateBeer and Yelp, which are also commonly used and representative for this task. Still, we appreciate your suggestions and, if time permits, we can consider adding fine-tuned datasets from Amazon to our experiments based on our models pre-trained on Wikipedia. For reference format: Thanks for your suggestions! We will make the necessary updates to our Reference section in the revision. (1) We will update 4 papers from arxiv source to official conference sources; (2) We will add page numbers of all cited papers; (3) We will include more references that you and other reviewers suggested. [–] Official Review of Paper252 by Reviewer VGpf KDD 2023 Conference Research Track Paper252 Reviewer VGpf 21 Mar 2023 (modified: 23 Mar 2023)KDD 2023 Conference Research Track Paper252 Official ReviewReaders: Program Chairs, Paper252 Area Chairs, Paper252 Reviewers Submitted, Paper252 Authors Summary: In this work, the authors present a model called UCEpic, designed to generate high-quality personalized natural language explanations for recommendation results. The model unifies aspect planning and lexical constraints through an insertion-based generation approach. Experimental results demonstrate that the proposed method often outperforms baseline techniques, as measured by both automatic metrics and human annotators. Paper Strength: Pros: -The authors have successfully identified and addressed significant shortcomings in existing research, which lends credibility to their proposed approach. -The writing quality is high, making the paper easy to follow and understand. -The experiments include several state-of-the-art baselines and an in-depth case study, providing convincing evidence. Paper Weakness: Cons: -Section 3 contains some vague statements that require further clarification (see suggestions). -The improvements compared to PETER are not substantial, particularly on the Yelp dataset. This might be related to the number of aspects being considered. -The number of aspects in both datasets is limited, and the proposed method seems to experience a performance decline when dealing with data containing more aspects. This raises questions about its effectiveness. Further analysis and experimentation are needed to address this issue. Questions To Authors And Suggestions For Rebuttal: -In the introduction to Section 3, please explain the variable "w" in greater detail. Lines 449-451, which define w^r, w^a, and w, are unclear. Providing an example would help clarify these concepts. -Consider masking and deleting essential keywords instead of do it randomly to improve efficiency. -Although the use of general pre-training is understandable, it might be efficient to utilize pre-trained checkpoints from Hugging Face. -The design of the two starting stages is not entirely clear. Please provide further explanation to help readers understand this aspect of the model. Technical Quality: 2: Fair - The proposed approach appears reasonable, however, certain fundamental assertions lack sufficient justification. Presentation: 3: Good - The paper clearly explains the technical parts, but the writing could be improved to better understand its contributions. Contributions: 3: Good - Could help ongoing research in a broader research community. Overall Assessment: 3: Borderline Accept: The paper is technically solid but has limitations. However, the reasons to accept outweigh the reasons to reject. Confidence Level: 3: The reviewer is fairly confident that the evaluation is correct. [–] Response to Reviewer VGpf KDD 2023 Conference Research Track Paper252 AuthorsZhankui He(privately revealed to you) 13 Apr 2023KDD 2023 Conference Research Track Paper252 Official CommentReaders: Program Chairs, Paper252 Area Chairs, Paper252 Reviewers Submitted, Paper252 Authors Comment: Thank you for taking the time to review our paper and providing us with your valuable feedback. For statements in Section 3: Thanks for pointing it out. In Section 3 of our manuscript, we use the variable w to represent tokens, including both normal tokens and aspect IDs. The variables w^r, w^a, and w represent tokens from the reference, aspect, and insertion stages, respectively. To make this clearer, we will provide an example in the revised version of our paper and add an explanation of w to the notation table 2. For the improvement compared to PETER: Our primary focus is on lexical constraints, as both automatic and human evaluation demonstrate significant improvements of our method compared to PETER. As indicated in Table 5, when given the same lexical constraints, our method achieves relative improvements against PETER by more than 28.48% across all n-gram metrics (B, M, R) and more than 4.52% on BS metric meanwhile showing comparable results (or slightly worse results in few metrics) in the aspect planning setting. We place greater emphasis on the human evaluation results, as they provide a more convincing measure of generation quality. As shown in Figure 5, our method outperforms PETER by a large margin in terms of human evaluation, highlighting the superiority of our approach. For the issues of aspects: We determine the number of aspects automatically using the topic mining tool (i.e., [16]) discussed in section 4.1. This setting is commonly used in previous explanation generation works, such as Ref2Seq. In addition, we have provided detailed explanations in our response to Reviewer g7bD regarding the distinction between aspects and lexical constraints. These explanations may help clarify why the number of aspects in our experiment is not particularly large. We are uncertain if the conclusion that "performance declines when dealing with data containing more aspects" is based on the comparison between Ratebeer and Yelp datasets. We have concerns that drawing conclusions from comparisons across different datasets may not be entirely reliable due to numerous confounding factors. Additionally, our human evaluation on the Yelp dataset, including many aspects, has demonstrated a significant improvement in our performance. For masking and deleting essential keywords: Thanks for your suggestion regarding the use of masking or deleting keywords non-randomly. We have considered this idea, and have observed that POINTER is a representative work that follows this approach. However, our experiments show that POINTER has some issues of robustness with various input keyphrases, where some input keyphrases may be not “essential” during training and become “out-of-distribution” in testing. Therefore, we have designed a simple-but-effective random masking strategy for our UCEpic. For explanation of the two starting stages: Aspect planning is a classic starting stage described in many related works [20, 21, 29, 31] in our reference section to generate explanations with multiple aspects. Lexically-constrained generation is used in controllable generation, and we have highlighted its potential use cases in our introduction section (lines 114-138). We will further clarify these points in our revision. For other comments: We want to clarify more in our paper that we do use pre-trained checkpoints of RoBERTa-base from Huggingface to start our model pre-training (following POINTER). However, we decided to pre-train our insertion-based models because most models from Huggingface are not specifically pre-trained for robust insertion-based generation, which limits the generation quality in our task. We understand that pre-training is not easy thus we are willing to contribute to the research community by releasing our pre-trained checkpoints to Huggingface upon acceptance. [–] Thanks KDD 2023 Conference Research Track Paper252 Reviewer VGpf 18 Apr 2023KDD 2023 Conference Research Track Paper252 Official CommentReaders: Program Chairs, Paper252 Area Chairs, Paper252 Reviewers Submitted, Paper252 Authors Comment: Thanks for the authors' detailed response. Many of my concerns have been resolved. [–] Thanks KDD 2023 Conference Research Track Paper252 AuthorsZhankui He(privately revealed to you) 19 Apr 2023KDD 2023 Conference Research Track Paper252 Official CommentReaders: Program Chairs, Paper252 Area Chairs, Paper252 Reviewers Submitted, Paper252 Authors Comment: We are glad to know the response addresses many concerns. Thanks for your feedback! [–] Official Review of Paper252 by Reviewer mcd7 KDD 2023 Conference Research Track Paper252 Reviewer mcd7 19 Mar 2023 (modified: 17 Apr 2023)KDD 2023 Conference Research Track Paper252 Official ReviewReaders: Program Chairs, Paper252 Area Chairs, Paper252 Reviewers Submitted, Paper252 Authors Summary: The authors of the paper propose UCEpic for generating explanations for recommendations, by unifying aspect planning and lexical constraints in an insertion-based generation manner. The proposed method shows superior results compared with existing explanation generation baselines via experiments and human evaluation. However, it is a bit unclear what's the incremental contribution compared with existing text generation methods with lexical constraints. There are some missing technical details and some issues around the robustness of the experiment results. I would appreciate the authors could further clarify those to strengthen the paper. Paper Strength: The paper describes in detail the proposed algorithm and each component/module therein. In the introduction section the authors clearly state the limitations of existing explanation generation methods, which motivates the proposal of their method. The authors compared with multiple baseline methods and also conducted a human evaluation via MTurk, which I appreciate very much. Paper Weakness: Methodological contribution: (1) I would appreciate if the authors could further clarify the methodological contribution of their proposed method. For example, Section 2 describes related work on lexically constrained text generation. I would encourage the authors stately clearly in the paper what is UCEpic's contribution to those works. (2) The proposed method assumes both aspects and lexical constraints are given, while some baseline methods don't have access to that information. Is it fair comparison in this case? Also, it might be infeasible in practice to assume lexical constraints are given. Can the authors comment on how can UCEpic be used in practice when the lexical constraints are not given? Missing technical details: (1) What language model (e.g. BERT, GPT etc.) was used in the experiment and how big is it (# of parameters)? Are the results consistent with different language models of different sizes? (2) All metrics reported in the figures and tables do not have a confidence interval / std. Were the results just based on one run? If so, given the randomness in the proposed algorithm, I would suggest to run the experiment for multiple times and report and avg number with standard deviation, to showcase the robustness of the proposed method. Questions To Authors And Suggestions For Rebuttal: See the "Paper Weakness" section. Would appreciate if the authors could clarify those points during rebuttal. Technical Quality: 2: Fair - The proposed approach appears reasonable, however, certain fundamental assertions lack sufficient justification. Presentation: 3: Good - The paper clearly explains the technical parts, but the writing could be improved to better understand its contributions. Contributions: 2: Fair - The study may be of relevance within a specific research community. Overall Assessment: 3: Borderline Accept: The paper is technically solid but has limitations. However, the reasons to accept outweigh the reasons to reject. Confidence Level: 2: The reviewer is willing to defend the evaluation, but it is quite likely that the reviewer did not understand central parts of the paper. [–] Response to Reviewer mcd7 KDD 2023 Conference Research Track Paper252 AuthorsZhankui He(privately revealed to you) 13 Apr 2023KDD 2023 Conference Research Track Paper252 Official CommentReaders: Program Chairs, Paper252 Area Chairs, Paper252 Reviewers Submitted, Paper252 Authors Comment: Thanks for your insightful comments on our papers! We are happy to reply your concerns as below: For methodological contribution: As demonstrated in Table 1, our UCEpic is not a replication of previous insertion-based generators. In contrast to general-purpose insertion-based generators, our model enables personalized generation and aspect planning through the proposed methods in our personalized fine-tuning stage and also supports a robust insertion pre-training stage for random keyphrase inputs, where many insertion-based baselines failed. Moreover, we argue that the insight of being the first work to consider insertion-based generators in recommendation explanation generation tasks itself is non-trivial. For fair comparison: We kept the inputs identical for different models (i.e., all baselines and our models) in these two experimental settings with minimal baseline implementation adjustments. For example, in our lexical constraints setting, we give identical lexical constraints (i.e., keyphrases) as inputs to all baselines (e.g., PETER) and our model, showing the effectiveness of our approach in a fair setting. We will revise the statement in lines 614-618 to provide more clarity on this matter. For practical use cases: Thanks for this question. UCEpic is an explanation generator conditioned on user reference, aspects and lexical constraints. Firstly, even without lexical constraints, UCEpic is capable of generating high-quality explanations, as shown in Table 5 where only aspect planning was used as input. Secondly, we argue that the use of lexical constraints is practical. For example, UCEpic can be used as a down-streaming model to take important keyphrases from many easy-to-access sources, like feature-aware recommendation models (e.g., HFT, from McAuley, Julian, and Jure Leskovec. "Hidden factors and hidden topics: understanding rating dimensions with review text." Proceedings of the 7th ACM conference on Recommender systems. ). We also listed more practical use cases of lexical constraints in line 134-138. For technical details: Thanks for your suggestions! Regarding the model size, we used a size that is comparable to POINTER, and we will update the comparison table as below to reflect this. We want to emphasize that insertion-based models usually rely on pre-trained bi-directional models like BERT with hundreds of millions of parameters, making them relatively larger than some auto-regressive models. However, the parameter size of our UCEpic model is smaller than other state-of-the-art insertion-based baselines in our experiments, which we will make clear in our revision. Model #Parameters Generation Style Backbone Ref2Seq ~31M Auto-regressive GRU PETER ~35M Auto-regressive Non-pre-trained Transformer POINTER ~340M Insertion-based Pre-trained BERT-large CBART ~410M Insertion-based Pre-trained BART-large UCEpic ~130M Insertion-based Pre-trained RoBERTa-base Thanks for suggesting multiple runs. We acknowledge the importance of conducting multiple runs to ensure the robustness of our results and will include the standard deviations of our experiments in our revision. Yet in the current submission, our proposed methods consistently outperform other baselines by a significant margin, indicating that the improvements are little likely due to randomness. [–] Post Rebuttal KDD 2023 Conference Research Track Paper252 Reviewer mcd7 17 Apr 2023KDD 2023 Conference Research Track Paper252 Official CommentReaders: Program Chairs, Paper252 Area Chairs, Paper252 Reviewers Submitted, Paper252 Authors Comment: I appreciate the author(s)'s efforts to provide additional details of the model used and clarifications of the contribution. I have increased my score. However, regarding the last point of reporting confidence intervals, I think it is still necessary to include std of the results: The language model itself has randomness in the generated response (what temperature was used?), so it would be great if the authors can demonstrate the results from different runs in the final version of the paper. [–] Response KDD 2023 Conference Research Track Paper252 AuthorsZhankui He(privately revealed to you) 19 Apr 2023KDD 2023 Conference Research Track Paper252 Official CommentReaders: Program Chairs, Paper252 Area Chairs, Paper252 Reviewers Submitted, Paper252 Authors Comment: Thanks for your feedback! We will include std in our final version. As for the decoding strategies, we conducted baseline experiments with the decoding strategies (e.g., beam search, top-p) suggested by the authors or from the original implementations. For our model, we simply used the greedy decoding (i.e., selecting the predicted tokens by argmax), where temperature can be treated as 0. We are glad to include these details in the final version of our paper, thanks again! [–] Official Review of Paper252 by Reviewer g7bD KDD 2023 Conference Research Track Paper252 Reviewer g7bD 22 Feb 2023KDD 2023 Conference Research Track Paper252 Official ReviewReaders: Program Chairs, Paper252 Area Chairs, Paper252 Reviewers Submitted, Paper252 Authors Summary: In this paper, the authors propose a novel model UCEpic that could generate high-quality personalized explanations for recommendation results by unifying aspect planning and lexical constraints in an insertion-based generation manner. By doing so, the proposed model could generate specific aspect information correctly and becomes more convincing as a result. The authors demonstrate the advantages of their proposed model through offline experiments on two recommendation datasets and also present a case study to evaluate the quality of the generated explanations. Paper Strength: (1) The introduction and literature review sections are well-written: it is easy (even for researchers outside this field) to understand the research question and the proposed solution. The citations included in this paper are appropriate and recent. (2) The research problem of generating explanations for recommendations is a timely and important one - multiple literature have demonstrated the importance of doing that to make the recommendations more useful. The model (aspect planning plus lexical constraints) is technically sound, produces good performance, and has the potential to be applied to other applications as well. (3) I especially commend the researchers' efforts to include a case study and human evaluation during the Paper Weakness: There are many missing pieces in the experiment section which I wish the authors could clarify: (1) What are the ground truth explanations in the offline datasets? If the authors choose to use the user reviews, that is unfortunately not a good substitute for recommendation explanations as product information and recommendation process are not explicitly taken into account. (2) How are lexical constraints constructed in the offline datasets? Are they equivalent to the aspect labels? Why should we believe that **EVERY" aspect dimension should appear in the generated explanation, rather than the most important, selective ones? (3) As the two offline datasets are rather small and the authors mention heavy computational costs in the Appendix, the scalability concern needs to be addressed. Furthermore, the authors should compare the required training time with baseline models - it is not meaningful if the proposed model takes a significantly longer time to run. (4) More information needs to be provided in the human evaluation part - how are the participants recruited, what are their demographics, have they written reviews on online platforms before, what are their educational background, etc... All of these factors would affect the outcome of the evaluation and the authors are strongly suggested to conduct a proper analysis, rather than presenting the numbers directly from the aggregate level. (5) BLEU and ROUGE have been shown to be unreliable metrics for text generation tasks. The authors might want to consider other evaluation metrics, such as those in [1]. [1] Celikyilmaz, A., Clark, E. and Gao, J., 2020. Evaluation of text generation: A survey. arXiv preprint arXiv:2006.14799. (6) A minor point: the idea of this paper and the model structure actually resembles that of CopyNet [2], and I wonder if the authors wish to include that as one of the baseline models. (or if not, maybe comment on the reasons for superiority over CopyNet). [2] Gu, J., Lu, Z., Li, H. and Li, V.O., 2016, August. Incorporating Copying Mechanism in Sequence-to-Sequence Learning. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 1631-1640). Questions To Authors And Suggestions For Rebuttal: Please find my questions in the "Paper Weakness" section. It is overall a nice-written paper but the experiment section needs substantial clarification. Technical Quality: 3: Good - The approach is generally solid with minimal adjustments required for claims that will not significantly affect the primary results Presentation: 3: Good - The paper clearly explains the technical parts, but the writing could be improved to better understand its contributions. Contributions: 2: Fair - The study may be of relevance within a specific research community. Overall Assessment: 3: Borderline Accept: The paper is technically solid but has limitations. However, the reasons to accept outweigh the reasons to reject. Confidence Level: 4: The reviewer is confident but not absolutely certain that the evaluation is correct. [–] Response to Reviewer g7bD KDD 2023 Conference Research Track Paper252 AuthorsZhankui He(privately revealed to you) 13 Apr 2023KDD 2023 Conference Research Track Paper252 Official CommentReaders: Program Chairs, Paper252 Area Chairs, Paper252 Reviewers Submitted, Paper252 Authors Comment: Thank you for taking the time to make constructive comments on our papers. For ground-truth explanations: We agree harvesting explanations from reviews is not perfect. However, given the difficulty of obtaining real natural language explanations, reviews have been widely adopted as a surrogate for explanations in many previous works, e.g., PETER, Ref2Seq. In our experiments, we found the extracted explanations from reviews often contain useful product information and other details that were difficult to be generated directly using previous models. This observation could be dataset dependent, but show Yelp and Ratebeer datasets are suitable for our tasks. For lexical constraints: We’ll provide a clearer explanation of the difference and how we construct them. Aspects are like “topic words” e.g., “taste”, “price”, which are used to guide the explanation generation within the specific topics as Ref2Seq did; while lexical constraints are specific keyphrases which the generated sentences must contain, e.g., “light grapefruit finish”. Typically, the number of aspects is much smaller than the number of lexical constraints and aspects are more high-level. To construct our aspects and lexical constraints, we use a state-of-the-art phrase mining tool ([16] in our reference) to extract meaningful phrases from the explanation corpus and treat them as lexical constraints. Then this tool automatically determines the number of clusters for all extracted phrases, where the clusters are treated as aspects. We will include more details in our revision. Thanks for the discussion on whether every aspect / keyphrase is important. We don’t assume all aspects or lexical constraints are equally important. Therefore, we (1) investigated different scenarios using different numbers of lexical constraints in fig 3, and (2) mentioned our UCEpic model can potentially be used as a tool for users to freely query the explanations using their aspects or keyphrases of interest (lines 135-138). For training time of our models: We clarify the computational cost of our model is mainly incurred during the pre-training phase, it is reasonable because (1) pre-training in NLP is typically much heavier than fine-tuning. All pre-trained models in our comparisons, e.g., POINTER, CBART are similar or even heavier than ours in their pre-training phases – the model size table in our replies to Reviewer mcd7 also indicates this; (2) UEpic is pre-trained only once. Then UCEpic can be fine-tuned on explanation generation datasets from different domains, which is a fast process. For instance, we fine-tuned UCEpic on the Yelp dataset within 2.5 hours in the GPU environment described in our paper, making it practical for real-world usage. We will provide more details on this in our revision. (3) our pre-trained models will be publicly released upon acceptance on e.g., Huggingface, allowing researchers to fine-tune their models based on our pre-trained checkpoint without high computational costs. For the details of human evaluation: Thank you for your feedback on our human evaluation. We conducted a standard human evaluation process on MTurk, a popular crowd-sourcing platform hiring many workers for annotation tasks or similar by their standards, where we published our surveys for workers. Our only requirement was that workers use English as the first language. We did not collect information on demographics, educational background and the experience of writing reviews as we believe anyone can provide valuable feedback in our task regardless of their background. We still appreciate your suggestion to conduct further analysis based on worker groups, which would be intriguing. For BLEU and ROUGE: Thanks for the suggestion of evaluation metrics. BELU and ROUGE are commonly used n-gram metrics for text generation in many papers [7, 20, 21, 31] , so we use them to reflect the generation quality to some extent. But, we agree these metrics have limitations, so BERTScore (denoted as BS) is also used in our evaluation – which is in the category of BERT-based evaluation in the survey you suggested – and also shows the better performance of our method. Moreover, we emphasize the importance of human evaluation over all automatic metrics as it provides more reliable measurements, where our methods greatly outperform baselines. For CopyNet: Thanks for mentioning the interesting work CopyNet, where the copy mechanism can also be considered as a way for some lexical constraints. Yet for the superiority over CopyNet in this work is (1) the copy mechanism in CopyNet usually selects a subset of words from the inputs according to scores from “copy mode”, which does not ensure all keyphrases included; (2) CopyNet is based on RNNs, which are generally less powerful than transformer-based models after pre-training, leading to lower generation quality. We appreciate the suggestion and will discuss CopyNet in our related work section. [–] Thanks KDD 2023 Conference Research Track Paper252 Reviewer g7bD 17 Apr 2023KDD 2023 Conference Research Track Paper252 Official CommentReaders: Program Chairs, Paper252 Area Chairs, Paper252 Reviewers Submitted, Paper252 Authors Comment: I commend the reviewers on their efforts to clarify my concerns. I acknowledge that I have read the authors' response and my overall assessment remains unchanged. [–] Thanks KDD 2023 Conference Research Track Paper252 AuthorsZhankui He(privately revealed to you) 19 Apr 2023KDD 2023 Conference Research Track Paper252 Official CommentReaders: Program Chairs, Paper252 Area Chairs, Paper252 Reviewers Submitted, Paper252 Authors Comment: I am grateful for your commitment to reviewing our paper! Thanks!