Official Review of Paper253 by Reviewer YNSv KDD 2023 Conference Research Track Paper253 Reviewer YNSv 27 Mar 2023KDD 2023 Conference Research Track Paper253 Official ReviewReaders: Program Chairs, Paper253 Area Chairs, Paper253 Reviewers Submitted, Paper253 Authors Summary: In this paper, the authors present a novel framework, named Recformer, which can effectively learn language representations for a sequential recommendation. As the authors mentioned in their manuscript, there are multiple contributions to this paper: Paper Strength: The authors formulate items as key-value attribute pairs for the ID-free sequential recommendation and propose a novel bi-directional Transformer structure to encode sequences of key-value pairs. The authors design a learning framework to help the model learn users’ preferences, recommend items based on language representations, and transfer knowledge into different recommendation domains and cold-start items. Extensive experiments are conducted to show the effectiveness of the Recformer. Paper Weakness: The short description of the dataset should have some explanations. The description of several variants of the ablation study is not very clear, especially Variant(2)(3)(4). And for variant(6), I think the separate ablation of item position emb. & token type emb. Can better explain the impact of these two components on the model. In “Pre-training Steps vs. Performance.”, I noticed that when the pre-training step is 0, the NDCG@10 and recall@10 of Scientific are inconsistent with the results on other datasets; there should be some explanations for this problem. Questions To Authors And Suggestions For Rebuttal: (1) I think that in the introduction, the description of the limitations of the existing work, the challenges, and the corresponding solutions to solve this problem is a little confusing. There is a lack of some relevant correspondence among limitations, challenges, and solutions. (2) I also have some confusion about the design of Two-Stage Finetuning. The authors claim that in-batch negatives cannot provide accurate supervision in a small dataset because it is likely to have false negatives which undermine recommendation performance, so they propose two-stage finetuning. But I’m confused about how the Two-Stage Finetuning solves this problem(both theoretically and experimentally). Technical Quality: 3: Good - The approach is generally solid with minimal adjustments required for claims that will not significantly affect the primary results Presentation: 3: Good - The paper clearly explains the technical parts, but the writing could be improved to better understand its contributions. Contributions: 3: Good - Could help ongoing research in a broader research community. Overall Assessment: 3: Borderline Accept: The paper is technically solid but has limitations. However, the reasons to accept outweigh the reasons to reject. Confidence Level: 4: The reviewer is confident but not absolutely certain that the evaluation is correct. [–] Response to Reviewer YNSv KDD 2023 Conference Research Track Paper253 AuthorsJiacheng Li(privately revealed to you) 13 Apr 2023KDD 2023 Conference Research Track Paper253 Official CommentReaders: Program Chairs, Paper253 Area Chairs, Paper253 Reviewers Submitted, Paper253 Authors Comment: Weakness 1 and 2: Thanks for your suggestions to make the paper clear and easy to understand. We will modify it accordingly in the revised version. Weakness 3: For Figure 6 (Pre-training steps vs. Performance), we show the empirical results and the results of the Scientific dataset are different from others. We think the possible reason is that textual similarity is highly aligned with item similarity in the Scientific dataset. Hence, even without further pre-training on recommendation datasets, pre-trained language models can provide a good starting representation for recommendations. We will add an analysis of this observation. Question 1: Please see the general response for rearranged limitations, challenges and contributions. We will improve our introduction accordingly. Question 2: During pre-training, we use representations from models for contrastive learning with in-batch negatives. Whereas, our finetuning employs the matrix to provide positives and negatives for contrastive learning in which we can use the ground-truth next item as positive instances and all other items as negative instances. Hence, training instances will not contain false negatives for finetuning. However, we find that traditional finetuning methods (see Table 4 and ablation study) cannot achieve the best performance for recommendations. Therefore, we propose two-stage finetuning. [–] Official Review of Paper253 by Reviewer Wqu7 KDD 2023 Conference Research Track Paper253 Reviewer Wqu7 25 Mar 2023KDD 2023 Conference Research Track Paper253 Official ReviewReaders: Program Chairs, Paper253 Area Chairs, Paper253 Reviewers Submitted, Paper253 Authors Summary: The authors propose RECFORMER, a novel framework for sequential recommendation that effectively learns language representations. Recformer formulates items as key-value attribute pairs and flatten key-value attribute pairs into an item "sentence". To encode item sentences, a bi-directional Transformer model is designed. In addition, to pre-train RECFORMER, masked language modeling (MLM) and contrastive learning tasks are adopted. Besides of pre-training process, a two-stage finetuning is also conducted. Based on the above designs, RECFORMER can effectively recommend the next items based on text representations, and the knowledge learned from training can also be transferred to cold-start items or other recommendation scenarios. Experimental results show that RECFORMER outperforms existing methods in different settings, especially for zero-shot and cold-start item recommendations. An ablation study is also conducted to evaluate the effectiveness of the proposed components. In conclusion, RECFORMER offers a promising direction for handling cold-start item understanding issues in sequential recommender systems. The contributions of papers are listed below: Integrate MLM and contrastive learning in the same framework, and use two-stage fine-tuning to achieve excellent performance of natural language applications in sequential recommendation systems. Regard items information as "sentences" and encode them with the proposed RECFORMER method which can learn language representations and effectively transfer and generalize to new recommendation scenarios. The overall structure of the framework is innovative, and extensive experiments are conducted to demonstrate the effectiveness of the method in improving knowledge transfer to a large extent as shown by zero-shot and cold-start settings. Paper Strength: I would like to highlight the following key points addressed in this work: The overall design of RECFORMER is interesting. For instance, RECFORMER incorporates contrastive learning into the pre-training of sequence recommendation and integrates MLM to improve item embedding. This design can consider both language understanding and item recommendation. Furthermore, this approach allows for the adjustment of parameters to determine which factor should be given higher weight as a reference. The experimental results demonstrate exceptional performance in both cold-start and zero-shot scenarios. By encoding item sentences with techniques of MLM and contrastive learning, RECFORMER is able to effectively perform knowledge transfer when unknown new items are introduced. While there exist some works considering contrastive learning to deal with cold-start problems, the authors combine the contrastive task and language processing model is an interesting and promising concept for the sequential recommendation. The experiments are comprehensive and convincing. Numerous comparisons and featuring extensive ablation studies are conducted to demonstrate the necessity of the techniques employed in RECFORMER. In addition, the thorough analysis of the experimental results offers valuable insights for future research in this area. Paper Weakness: In recent years, related works using collaborative filtering have already employed contrastive learning to address the cold-start problem, such as the paper by Yinwei Wei et al., "Contrastive Learning for Cold-Start Recommendation" (ACM Multimedia 2021), which aims to incorporate contrastive learning with auxiliary information in collaborative filtering to tackle the cold-start issue. The authors may consider referring to this paper and emphasize the advantages of using a sentence-based approach over traditional methods. With regard to the cold-start experimental results, the study only compares RECFORMER SASRec and UniRec. Although the experiments prove that RECFORMER outperforms these two baselines, the reference [38] shows that S3-Rec has better cold-start performance compared to SASRec. To prove the effectiveness of RECFORMER in cold-start problems, the authors should also compare performance with S3-Rec. The dataset utilized in this study does not seem to be open-source, and there is no information available about its original source in either this paper or the cited references. Although the paper refers to using Amazon datasets and cites reference [22], this reference does not share this data and only employs a single "Amazon - clothing" dataset for their experiments. Furthermore, while the number of data samples used for fine-tuning is provided, there is no information about the size of the data used for pre-training. It is recommended to present additional evidence regarding the provenance of the dataset. Questions To Authors And Suggestions For Rebuttal: The authors cooperate the contrastive learning MLM techniques in the scenario of sequential recommendation, which is a novelty and promising concept. The effectiveness of RECFORMER in cold-start problems is also proved in experiments. However, some parts are not very easy to understand. I hope that the author's answer can help readers understand the content. The questions are listed below: Readers may find it challenging to comprehend the pre-training architecture presented in Fig.3(b) since the main content lacks a specific explanation for this figure. The explanations provided only pertain to the overall framework and Fig. 3(a). The connection between the parameters in Fig. 3(b) and the combination of MLM with contrastive learning is unclear. It would be helpful if the authors could explain this part more thoroughly. In Table 2, the Recall@10 of RECFORMER does not achieve the top or second-best result in the category of "Instruments". The authors should explain in more detail about this experimental result. For the suggestion of the techniques, Yuanmeng Yan et al.'s paper "ConSERT: A Contrastive Framework for Self-Supervised Sentence Representation Transfer" (ACL 2021) introduces ConSERT, a technique for learning sentence representations using contrastive learning. The paper employs four distinct data augmentation methods, which are adversarial attacks, token rearrangement, truncation, and dropout, to generate different views for contrastive learning. The authors may consider applying these strategies to improve the efficacy of contrastive learning. Technical Quality: 3: Good - The approach is generally solid with minimal adjustments required for claims that will not significantly affect the primary results Presentation: 3: Good - The paper clearly explains the technical parts, but the writing could be improved to better understand its contributions. Contributions: 3: Good - Could help ongoing research in a broader research community. Overall Assessment: 3: Borderline Accept: The paper is technically solid but has limitations. However, the reasons to accept outweigh the reasons to reject. Confidence Level: 5: The reviewer is absolutely certain that the evaluation is correct and very familiar with the relevant literature. [–] Response to Reviewer Wqu7 KDD 2023 Conference Research Track Paper253 AuthorsJiacheng Li(privately revealed to you) 13 Apr 2023KDD 2023 Conference Research Track Paper253 Official CommentReaders: Program Chairs, Paper253 Area Chairs, Paper253 Reviewers Submitted, Paper253 Authors Comment: Weakness 1 and 2: Thanks for your suggestions, we will add the model in ACM multimedia 2021 and S^3-Rec in our cold-start experiments. Please note that UniSRec is considered as the SOTA model for the cold-start and low-resource recommendation problems, hence we argue UniSRec is a strong baseline to show our effectiveness on the cold-start problem. Weakness 3: You can find the dataset link (http://jmcauley.ucsd.edu/data/amazon/) in page 6 of the cited paper. Or, directly go to the download page (https://nijianmo.github.io/amazon/index.html). This Amazon data is open-source and widely used in various recommendation works. For pre-training dataset statistics, we report the overall numbers for saving space. We will provide the preprocessing codes and all datasets are available. Question 1: Explanations for Figure 3(b) can be found in Section 2.3.1. We will elaborate this part in our revision. Question 2: This is caused by the relatively lower performance on Instruments than other datasets. We can explain the results from the aspect of dataset analysis in our revision. Question 3: Thanks for the suggestion. However, we want to point out that the contrastive learning method is out of the scope of our paper. Our paper focuses on how to unify recommendations and language understanding to benefit sequential recommendations. [–] Official Review of Paper253 by Reviewer 8uSq KDD 2023 Conference Research Track Paper253 Reviewer 8uSq 22 Mar 2023 (modified: 22 Mar 2023)KDD 2023 Conference Research Track Paper253 Official ReviewReaders: Program Chairs, Paper253 Area Chairs, Paper253 Reviewers Submitted, Paper253 Authors Summary: The paper studies the problem of sequential recommendation and how transferring knowledge from other recommendation domains can improve a model's performance. Many existing methods rely on text information, but they have not succeeded in unifying the training of language models and recommendations. To address this, the paper proposes organizing a user's historical items as an item sentence and using masked token prediction and item-item contrastive learning to pre-train the language model. The model can then be fine-tuned in different downstream datasets using a two-stage framework, resulting in state-of-the-art performance on various tasks in both fine-tuned and zero-shot scenarios. Paper Strength: The paper's motivation is clear, and the studied problem is non-trivial. Extracting shared and common representations of users and items from various recommendation scenarios is beneficial for novel applications. The model's promising zero-shot performance demonstrates the effectiveness of the pre-trained task. Additionally, the paper is well-organized and easy to follow, and the experiments are comprehensive. Text information is only used in the model, and it would be interesting to explore multi-modality combinations in future work. Paper Weakness: I have some questions in terms of the modeling: Should we include the key as a token in the sequence? Intuitively, providing a type embedding can also offer varying semantic information for value tokens with different keys. It is hard for me to justify the use of token position embedding since the key-value pairs are unordered within each item. Efficiency analysis is missing. The modeling approach relies on key-value construction, and it may not be entirely flexible for other sequential recommendation datasets. The technical novelty of the pretraining method is limited since it primarily adapts the language model and combines it with existing SSL methods. Questions To Authors And Suggestions For Rebuttal: See weakness. Technical Quality: 2: Fair - The proposed approach appears reasonable, however, certain fundamental assertions lack sufficient justification. Presentation: 3: Good - The paper clearly explains the technical parts, but the writing could be improved to better understand its contributions. Contributions: 3: Good - Could help ongoing research in a broader research community. Overall Assessment: 3: Borderline Accept: The paper is technically solid but has limitations. However, the reasons to accept outweigh the reasons to reject. Confidence Level: 3: The reviewer is fairly confident that the evaluation is correct. [–] Updated comments KDD 2023 Conference Research Track Paper253 Reviewer 8uSq 17 Apr 2023KDD 2023 Conference Research Track Paper253 Official CommentReaders: Program Chairs, Paper253 Area Chairs, Paper253 Reviewers Submitted, Paper253 Authors Comment: Thank you for addressing my questions and concerns. I am satisfied with the responses, and I will maintain my original recommendation score. [–] Response to Reviewer 8uSq KDD 2023 Conference Research Track Paper253 AuthorsJiacheng Li(privately revealed to you) 13 Apr 2023KDD 2023 Conference Research Track Paper253 Official CommentReaders: Program Chairs, Paper253 Area Chairs, Paper253 Reviewers Submitted, Paper253 Authors Comment: Question 1: A type embedding is a feasible option in our experiments. However, for better generalization, we expect our method can be applied to different scenarios where items may contain different attribute keys. Hence, we treat the key as tokens in the sequences. Question 2: Key-value pairs are unordered but attribute values can be a short sentence (e.g., a title described by natural languages) in our case. Hence, we think the use of token position embedding is still reasonable. Question 3: Here, we give a simple efficiency analysis on SASRec, Recformer and P5. Specifically, we conduct inference on Scientific test set with three models. The speed of inference is shown below: SASRec: 230.34 instance/s Recformer: 8.64 instance/s P5: 2.76 instance/s We can see that traditional sequential recommender SASRec is highly faster than recommenders based on language models, because SASRec uses a two-layer Transformer whereas language models usually use a deep Transformer (e.g., 12 layers). Comparing Recformer to P5, we can find that Recformer is faster than P5, because P5 uses beam search to generate item ids but Recformer produces sequence representations and computes similarities by the inner product. We will include some time comparisons among different methods in the revision. Question 4: We agree that key-value attributes may not be eligible for all cases (e.g., items without attributes). However, key-value attribute is a general format for item attributes which can be applied in most cases. Question 5: We are the first to model sequential recommendation as a language understanding task. To this end, we propose a new model structure (different from the language model for text e.g., item position embeddings) and two-stage finetuning method for recommendations. [–] Official Review of Paper253 by Reviewer BGtm KDD 2023 Conference Research Track Paper253 Reviewer BGtm 21 Mar 2023 (modified: 20 Apr 2023)KDD 2023 Conference Research Track Paper253 Official ReviewReaders: Program Chairs, Paper253 Area Chairs, Paper253 Reviewers Submitted, Paper253 Authors Summary: The paper proposes a sequential recommendation model that utilized text-based features to solve the cold-start problem in recommendation. The paper aims to address three challenges: 1) the NLP-based pre-trained language models are different to the domain of RS; 2) how to model the text representations in sequential recommendation; and 3) independent training of PLM and Recommendation models. Utilizing text-based information in RS is an interesting question, especially in the era of the rapid development of LLM. However, the technique improvement in the paper is incremental and the experiments don’t fit the motivation well. Paper Strength: (1) The paper proposes to utilize text-based information to enhance sequential recommendation, which is an interesting research direction. (2) The paper proposes a common framework for models such as BERT and LongFormer. (3) The experimental results show that the proposed method can outperform the baselines. Paper Weakness: (1) The paper claims that a universal input data format of items is proposed for language by adding the token embeddings to the PLM. However, bot [1] and [2] mentioned how to add such embeddings into LM modules. Therefore, the addtional technical contributions are incremental. It is better if the authors discuss the differences between the discriminative models and the language models on this topic. It is also expected to see more experiments to compare [1] and other non-text transfer learning methods. (2) The motivation is to utilize text-based information to solve “cold-start” problem in RS. However, most of the experiments are based on the general sequential recommendation settings. It is more convincing if the authors could conduct experiments that align well with the motivation. To my understanding, Table 3 and Figure 4 report the major experimental results of all the methods (including baselines) and the datasets. (3) Minor issues: I note that different experimental analysis were conducted on different datasets. I am cuious about the reasons. It is better to conduct significant test on the improvements as the improvements are not large. [1] Geng, S., Liu, S., Fu, Z., Ge, Y., & Zhang, Y. (2022, September). Recommendation as language processing (rlp): A unified pretrain, personalized prompt & predict paradigm (p5). In Proceedings of the 16th ACM Conference on Recommender Systems (pp. 299-315). [2] Cui, Z., Ma, J., Zhou, C., Zhou, J., & Yang, H. (2022). M6-Rec: Generative Pretrained Language Models are Open-Ended Recommender Systems. arXiv preprint arXiv:2205.08084. Questions To Authors And Suggestions For Rebuttal: (1) Please give more discussions on the additional technical contributions of the proposed methods given the LLM-based recommendation model in [1]? (2) I note that different experimental analysis were conducted on different datasets. I am curious about the reasons. (3) Some improvements shown in Table 3 and Table 4 are marginal. Are the improvements significant? (4) From Table 4, it seems that the two-stage fine-tuning achieved limited improvement, which doesn’t fit well with the claim. (5) It is better to show the inference time of the proposed methods since the sequences are long in the settings. The author responses addressed most of my questions. Technical Quality: 3: Good - The approach is generally solid with minimal adjustments required for claims that will not significantly affect the primary results Presentation: 3: Good - The paper clearly explains the technical parts, but the writing could be improved to better understand its contributions. Contributions: 3: Good - Could help ongoing research in a broader research community. Overall Assessment: 3: Borderline Accept: The paper is technically solid but has limitations. However, the reasons to accept outweigh the reasons to reject. Confidence Level: 3: The reviewer is fairly confident that the evaluation is correct. [–] Response to Reviewer BGtm KDD 2023 Conference Research Track Paper253 AuthorsJiacheng Li(privately revealed to you) 13 Apr 2023KDD 2023 Conference Research Track Paper253 Official CommentReaders: Program Chairs, Paper253 Area Chairs, Paper253 Reviewers Submitted, Paper253 Authors Comment: Weakness 1 and Question 1: We want to argue that our method has significant contributions compared to M6-Rec and P5 and explain why we did not compare these two methods. We have a different study scope compared to M6-Rec and P5: these two methods focus on unifying different recommendation tasks but we focus on sequential recommendation with language models. M6-Rec authors do not publish their code and the paper is not published in any journal or conference. We tried to compare P5 to our work. However, we found we cannot re-implement the results reported in their paper when we use shuffled ids. Other researchers have similar concerns: https://github.com/jeykigung/P5/pull/3 Although P5 is using a language model for recommendations, P5 still uses item ids as item embeddings and does not involve text representations for items. We think there is a significant difference between our method and P5. Weakness 2: “Cold-start” is just one of our motivations. The main problem studied in our paper is transfer learning in sequential recommendations which is an important problem mentioned in previous works. Using text representations, we can have transferable training for the sequential recommendation. General sequential recommendation validates the transfer learning results of our method. Weakness 3 (1) and Question 2: For all experimental analysis, we include at least the same Scientific and Instruments datasets for comparison to validate our claims. We did not include all datasets due to page limitations. Weakness 3 (2) and Question 3: For Table 3, our method has large improvements on three datasets which shows effectiveness on the cold-start problem. For Table 4, we conduct an ablation study that adjusts some minor components and hence have some marginal changes in performance. Question 4: Two-stage finetuning achieves non-trivial improvements on the Instruments dataset and consistent improvements on the two datasets. Question 5: Here, we give a simple efficiency analysis on SASRec, Recformer and P5. Specifically, we conduct inference on Scientific dataset with three models. The speed of inference is shown below. SASRec: 230.34 instance/s Recformer: 8.64 instance/s P5: 2.76 instance/s We can see that traditional sequential recommender SASRec is highly faster than recommenders based on language models because SASRec uses a two-layer Transformer whereas language models usually use a deep Transformer (e.g., 12 layers). Comparing Recformer and P5, we can find that Recformer is faster than P5, because P5 uses beam search to generate item ids but Recformer produces sequence representations and computes similarities by the inner product. [–] Response to author rebuttal KDD 2023 Conference Research Track Paper253 Reviewer BGtm 19 Apr 2023KDD 2023 Conference Research Track Paper253 Official CommentReaders: Program Chairs, Paper253 Area Chairs, Paper253 Reviewers Submitted, Paper253 Authors Comment: I have reviewed the author responses and thanks a lot for your active responses.