Reviewer(s)' Comments to Author: Reviewer: 1 Recommendation: Accept Comments: The authors have solved the reproducibility problem. For FRBO their comments make sense, but considering RBO@20 and not all the items the situation changes. I remain of the opinion to put FRBO in Camera Ready. I recommend the paper to be accepted. Additional Questions: Review's recommendation for paper type: Full length technical paper Does the paper present innovative ideas or material?: In what ways does this paper advance the field?: Rate how well the ideas are presented (very difficult to understand=1 very easy to understand =5): Rate the information in the paper is it sound, factual, and accurate?(poor=1 excellent=5): Please explain why.: Rate the paper on its contribution to the body of knowledge to this field (none=1, very important=5): Rate the overall quality of the writing (very poor=1, excellent=5): Does this paper cite and use appropriate references?: If not, what important references are missing?: Is the treatment of the subject complete?: If not, What important details / ideas/ analyses are missing?: Please help ACM create a more efficient time-to-publication process: Using your best judgment, what amount of copy editing do you think this paper needs?: None Most ACM journal papers are researcher-oriented. Is this paper of potential interest to developers and engineers?: Yes Reviewer: 2 Recommendation: Accept Comments: The revised version of the manuscript has been improved compared to the previous one and all the questions placed in my review have been answered. Additional Questions: Review's recommendation for paper type: Full length technical paper Does the paper present innovative ideas or material?: Yes In what ways does this paper advance the field?: See my previous review Rate how well the ideas are presented (very difficult to understand=1 very easy to understand =5): 2 Rate the information in the paper is it sound, factual, and accurate?(poor=1 excellent=5): 4 Please explain why.: See my previous review Rate the paper on its contribution to the body of knowledge to this field (none=1, very important=5): 3 Rate the overall quality of the writing (very poor=1, excellent=5): 3 Does this paper cite and use appropriate references?: Yes If not, what important references are missing?: Is the treatment of the subject complete?: Yes If not, What important details / ideas/ analyses are missing?: Please help ACM create a more efficient time-to-publication process: Using your best judgment, what amount of copy editing do you think this paper needs?: Moderate Most ACM journal papers are researcher-oriented. Is this paper of potential interest to developers and engineers?: Yes Reviewer: 3 Recommendation: Accept Comments: This version effectively addresses most of my previous concerns regarding the baseline and the non-SOTA nature of the recommender system used. Specifically, the authors have replaced the earlier LSTM-based approach with the 2024-proposed LRURec recommender, which is more aligned with current advancements in the field. Additionally, the authors have provided detailed explanations in the experimental setting, clarifying why many baselines could not be used in this context. Some baselines require multimodal data inputs, while others necessitate training from scratch, which does not fit well within the fine-tuning framework that FINEST operates in. Moreover, the method has proven effective through extensive experiments, demonstrating substantial improvements in evaluation metrics. Additionally, ablation studies in Table 5 validate the necessity of each component within the FINEST framework, reinforcing its practical value in real-world applications. Additional Questions: Review's recommendation for paper type: Full length technical paper Does the paper present innovative ideas or material?: Yes In what ways does this paper advance the field?: The paper introduces FINEST, a fine-tuning method that stabilizes recommender systems by preserving the rank order of recommendations. This method addresses the critical issue of recommendation instability under data perturbations, making it a significant contribution to the field. FINEST simulates real-world perturbations using various methods such as Random, Earliest-Random, and CASPER. This approach ensures that the system can maintain consistent recommendations even when minor changes occur in the input data, enhancing the robustness of recommender systems. Rate how well the ideas are presented (very difficult to understand=1 very easy to understand =5): 5 Rate the information in the paper is it sound, factual, and accurate?(poor=1 excellent=5): 4 Please explain why.: The paper has been revised. The current version is easy to understand. Rate the paper on its contribution to the body of knowledge to this field (none=1, very important=5): 4 Rate the overall quality of the writing (very poor=1, excellent=5): 4 Does this paper cite and use appropriate references?: Yes If not, what important references are missing?: Is the treatment of the subject complete?: Yes If not, What important details / ideas/ analyses are missing?: Please help ACM create a more efficient time-to-publication process: Using your best judgment, what amount of copy editing do you think this paper needs?: Light Most ACM journal papers are researcher-oriented. Is this paper of potential interest to developers and engineers?: Yes ================================================== Reviewer(s)' Comments to Author: Reviewer: 1 Recommendation: Accept Comments: The revised version of the manuscript has been improved compared to the authors' initial submission, and all the issues I highlighted in my review have been addressed. Additional Questions: Review's recommendation for paper type: Full length technical paper Does the paper present innovative ideas or material?: Yes In what ways does this paper advance the field?: See my first review Rate how well the ideas are presented (very difficult to understand=1 very easy to understand =5): 2 Rate the information in the paper is it sound, factual, and accurate?(poor=1 excellent=5): 3 Please explain why.: Rate the paper on its contribution to the body of knowledge to this field (none=1, very important=5): 3 Rate the overall quality of the writing (very poor=1, excellent=5): 4 Does this paper cite and use appropriate references?: Yes If not, what important references are missing?: Is the treatment of the subject complete?: Yes If not, What important details / ideas/ analyses are missing?: Please help ACM create a more efficient time-to-publication process: Using your best judgment, what amount of copy editing do you think this paper needs?: Moderate Most ACM journal papers are researcher-oriented. Is this paper of potential interest to developers and engineers?: Yes Reviewer: 2 Recommendation: Major Revision Comments: Thank you for your kind reply and clarification. 2.5 --> Having a range between [0,1] and getting the value 1 when the two finite lists are identical is very different. In fact, with the p=0.9 value recommended in your paper (which is also the most widely used in the community), for two identical lists of length 20, the RBO using the rbo library (the same one you use) will be 0.88. Even with other implementations, the value will always be <1. According to the paper [6], they normalise the RBO between 0 and 1, so it is necessary to use FRBO. The analysis of RBO and Jaccard values has been somewhat overlooked: in fact, a high RBO value and a low Jaccard value can indicate that... Finally, a negative note: I have tested your code on almost all implementations and different datasets on 2 different machines, and the results obtained do not match those of the article, leading to different conclusions. It is essential to provide the functional code used; moreover, it is crucial to set the seeds to allow 100% reproducibility on each machine. Please resolve these issues to ensure the reproducibility of your work. Additional Questions: Review's recommendation for paper type: Does the paper present innovative ideas or material?: In what ways does this paper advance the field?: Rate how well the ideas are presented (very difficult to understand=1 very easy to understand =5): Rate the information in the paper is it sound, factual, and accurate?(poor=1 excellent=5): Please explain why.: Rate the paper on its contribution to the body of knowledge to this field (none=1, very important=5): Rate the overall quality of the writing (very poor=1, excellent=5): Does this paper cite and use appropriate references?: If not, what important references are missing?: Is the treatment of the subject complete?: If not, What important details / ideas/ analyses are missing?: Please help ACM create a more efficient time-to-publication process: Using your best judgment, what amount of copy editing do you think this paper needs?: Moderate Most ACM journal papers are researcher-oriented. Is this paper of potential interest to developers and engineers?: Yes Reviewer: 3 Recommendation: Accept Comments: Although the primary method used in the paper—Rank-preserving Regularization loss to enhance model robustness against training data perturbations—isn't particularly novel, this version addresses most of our previous concerns regarding the baseline and the non-SOTA nature of the recommender. Moreover, the method has proven to be effective. Compared with the previous version, they've replaced the LSTM with the 2024 proposed LRURec recommender and provided explanations in the experimental setting for why many baselines couldn't be used. Additional Questions: Review's recommendation for paper type: Full length technical paper Does the paper present innovative ideas or material?: Yes In what ways does this paper advance the field?: 1. Introduction of Stability Enhancement Methods: The paper proposes the method of stabilizing recommender systems by introducing FINEST, a novel fine-tuning approach that preserves the rank order of recommendations. 2. Simulation of Real-world Perturbations: FINEST innovatively simulates real-world perturbations by randomly altering user-item interactions in the training data. This step is crucial as it mimics actual user behavior and external noise, providing a robust framework to test and enhance the stability of recommender systems. 3. Rank-preserving Regularization Technique: The introduction of a rank-preserving regularization technique is a significant advancement. By incorporating this regularization during the fine-tuning process, FINEST ensures that the recommendation rankings remain stable despite data perturbations. 4. Comprehensive Evaluation of Real-world Datasets: The paper's extensive experiments on diverse, real-world datasets (such as LastFM, Foursquare, and Reddit) demonstrate the effectiveness of FINEST across various domains. Rate how well the ideas are presented (very difficult to understand=1 very easy to understand =5): 4 Rate the information in the paper is it sound, factual, and accurate?(poor=1 excellent=5): 4 Please explain why.: The ideas in this paper are presented in a clear and concise manner, making them easy to understand and follow. Rate the paper on its contribution to the body of knowledge to this field (none=1, very important=5): 3 Rate the overall quality of the writing (very poor=1, excellent=5): 4 Does this paper cite and use appropriate references?: Yes If not, what important references are missing?: Is the treatment of the subject complete?: Yes If not, What important details / ideas/ analyses are missing?: Please help ACM create a more efficient time-to-publication process: Using your best judgment, what amount of copy editing do you think this paper needs?: Light Most ACM journal papers are researcher-oriented. Is this paper of potential interest to developers and engineers?: Maybe ================================================== Reviewer(s)' Comments to Author: Reviewer: 1 Recommendation: Minor Revision Comments: In the work titled “FINEST: Stabilizing Recommendations by Rank-Preserving Fine-Tuning”, the authors propose a recommender systems method able to face some well-known problems by getting reference rank lists from a given recommendation model, which is tuned using simulated perturbation scenarios with rank-preserving regularization on sampled items, claiming that the validation process performed on real-world datasets shows its effectiveness to ensure that such a recommender model generates stable recommendations under a wide range of different perturbations without compromising next-item prediction accuracy. The manuscript submitted by the authors appears well-written and organized in its content, although it is unbalanced in its introductory part, especially regarding the "Related Work" section, which is unbalanced both compared to the "Introduction" section and the rest of the manuscript. To address this issue, the authors should consider using the "Introduction" section solely to provide a general overview of the research context under consideration, and then delve into each aspect in the subsequent "Related Work" section. Additionally, the limited number of discussed works is insufficient to provide adequate information about the domain to the readers. The references used are sufficiently up-to-date, although, in light of my previous observation, the authors should discuss further works related to the considered domain to offer an adequate overview to the readers, such as, just by way of example: (-) Wang, Shoujin, et al. "A survey on session-based recommender systems." ACM Computing Surveys (CSUR) 54.7 (2021): 1-38. (-) Hu, Linmei, et al. "Graph neural news recommendation with long-term and short-term interest modeling." Information Processing & Management 57.2 (2020): 102142. (-) Saia, Roberto, Ludovico Boratto, and Salvatore Carta. "Semantic coherence-based user profile modeling in the recommender systems context." International Conference on Knowledge Discovery and Information Retrieval. Vol. 2. SciTePress, 2014. (-) Sharma, Sunny, Vijay Rana, and Vivek Kumar. "Deep learning based semantic personalized recommendation system." International Journal of Information Management Data Insights 1.2 (2021): 100028. (-) ... and so on. In addition, regarding the references used, the authors should avoid citing preprint versions (e.g., arXiv), as they have not undergone a peer review process, and therefore their scientific value is not certified in any way. As for the proposed method, its formalization is clear, and the experimental results appear convincing. However, it is advisable to separate the discussion of the results from the conclusions, and the scientific contribution should also be better highlighted both compared to other literature works and potential application scenarios. Additional Questions: Review's recommendation for paper type: Full length technical paper Does the paper present innovative ideas or material?: Yes In what ways does this paper advance the field?: In the context of Recommender Systems, the authors propose a novel approach: a model fine-tuned through simulated perturbation scenarios with rank-preserving regularization on sampled items. Rate how well the ideas are presented (very difficult to understand=1 very easy to understand =5): 4 Rate the information in the paper is it sound, factual, and accurate?(poor=1 excellent=5): 4 Please explain why.: The references used are sufficiently up-to-date, the proposed approach formalization is clear, and the experimental results appear convincing. Rate the paper on its contribution to the body of knowledge to this field (none=1, very important=5): 3 Rate the overall quality of the writing (very poor=1, excellent=5): 4 Does this paper cite and use appropriate references?: Yes If not, what important references are missing?: Is the treatment of the subject complete?: Yes If not, What important details / ideas/ analyses are missing?: Please help ACM create a more efficient time-to-publication process: Using your best judgment, what amount of copy editing do you think this paper needs?: Moderate Most ACM journal papers are researcher-oriented. Is this paper of potential interest to developers and engineers?: Yes Reviewer: 2 Recommendation: Minor Revision Comments: First of all, congratulations on your work, it is well done and I like it a lot. I have some points that I think could be improved: 1) Page 2 line 33: what do you mean by "original rank lists"? The ground truth list or the one from training the model without training perturbations? 2) Related work section: 2.1) I think the section on "Adversarial Machine Learning" is out of scope. I think it would be interesting to expand the section on "Robust Recommender Systems", especially by briefly explaining how the methods presented in Table 1 work. 3) Page 4, line 29: I think there is a typo in the formulation of S_u and T_u. There is a comma missing after the three dots? 4) In the introduction, you mention three papers that cover the topic of training perturbation in the input sequence [6, 42, 72] (of your references), but then in Section 3.2 you only mention [42]. Why? 5) In particular, the authors of [6] introduce a new ranking measure, finite-RBO, because standard RBO does not converge to 1 even when applied to finite-length identical lists, which is the case in recommender systems. I think you should include this metric instead of RBO. 6) Continuing to talk about (F)RBO, in Table 3, under which ranking is calculated by original? What is the ground truth? (this is a bit like question 1) 7) It is not clear to me if the other models were also fine tuned by the element perturbation or only fine tuned by your new loss. Can you clarify this point? 7.1) Are the perturbations done for each user or for one user at each epoch? Are the perturbations done for each user or for one user at each epoch? If the latter, is it chosen randomly? 8) Dataset section: 8.1) Table 2 shows the statistics after preprocessing? Could you also add the density and average sequence length? 8.2) What type of Foursquare dataset are you using? New York City or Tokyo? 8.3) I think 4 citations for each dataset is too many. Please keep just one and add the hyperlink to where it is possible to download it. 8.4) You mention the Reddit dataset, but you never mention it in the results. Please remove or add the results for it. 9) Why did you choose only models with a specific training? TiSASRec also uses the temporal information, while BERT4Rec is trained with cloze tasks. This is not a criticism, but I would like to understand the motivation behind this choice. Also, why did you use LSTM and not another SRS model? 9.1) How is the timestamp of adding an item managed? I understand that you use a timestamp just before adding an element (page 7 line 34), but maybe you can discuss this better. 10) Could you also add NDCG and Precision metrics? 11) Please move Table 3 to where it is mentioned in the text. 12) Page 11 line 35 mentions an appendix, but I think it is section 6.7 instead. 13) In 6.4 you mention the scalability of FINEST, but is it of FINEST or of the model that uses FINEST? 14) The ablation studies are very interesting, especially Table 5 and Figure 7. 15) My last question is about the real world application of your method due to the fact that the model has seen all the items in the training data of the base recommender. I can't think of a real world scenario where a model sees all the information and then casually perturbs the same data from where it was trained. What am I missing? 16) Talking about the code: 16.1) It is well commented, thanks. 16.2) The variable in the line 96 in the LSTM.py file is not used, please remove it. 16.3) I think for more clarity it can be split into several files by keeping the name of architecture.py and then the train.py. What do you think? 16.4) Why did you put the least common element counter outside the training loop? Otherwise you will always insert the same element? 16.5) Please also add the other architectures used in your work. Additional Questions: Review's recommendation for paper type: Full length technical paper Does the paper present innovative ideas or material?: Yes In what ways does this paper advance the field?: They present a first approach for fine-tuning sequential recommendation systems to increase their robustness. Rate how well the ideas are presented (very difficult to understand=1 very easy to understand =5): 3 Rate the information in the paper is it sound, factual, and accurate?(poor=1 excellent=5): 3 Please explain why.: Many points can be better explained to improve the quality of the work, as can be seen in the notes to authors. Rate the paper on its contribution to the body of knowledge to this field (none=1, very important=5): 4 Rate the overall quality of the writing (very poor=1, excellent=5): 3 Does this paper cite and use appropriate references?: Yes If not, what important references are missing?: Is the treatment of the subject complete?: No If not, What important details / ideas/ analyses are missing?: See notes to the authors Please help ACM create a more efficient time-to-publication process: Using your best judgment, what amount of copy editing do you think this paper needs?: Moderate Most ACM journal papers are researcher-oriented. Is this paper of potential interest to developers and engineers?: Yes Reviewer: 3 Recommendation: Minor Revision Comments: My questions are mainly about the baselines and experiment settings. Please refer to the comments in "What important details / ideas/ analyses are missing?" Additional Questions: Review's recommendation for paper type: Full length technical paper Does the paper present innovative ideas or material?: Yes In what ways does this paper advance the field?: The main problem addressed by the article is the vulnerability of recommender systems to minor perturbations in training data, which can lead to significant inconsistencies in the recommendations provided to users. The perturbatioin can arise from various factors, such as a user mistakenly clicking on an item. This problem is meaningful in practice because of: - Reduced Accuracy: Perturbations can decrease the accuracy of the recommendations, making the system less effective at matching user preferences. - User Experience Deterioration: Frequent and unexplained changes in recommendations can frustrate users, leading to decreased engagement and satisfaction. The paper proposes FINEST, which starts with a base recommender system and enhances its stability through a carefully designed fine-tuning process. Initially, it generates reference ranking lists from training data using the base model. Then, it introduces pseudo-perturbations to the data by randomly altering interactions, mimicking real-world changes. With this perturbed data, FINEST applies a specialized regularization technique to preserve the ranking order of recommendations. This process is repeated over several epochs to iteratively improve the model's resilience to data changes, culminating in a fine-tuned recommender system that maintains consistent and accurate recommendations even when faced with perturbations. Rate how well the ideas are presented (very difficult to understand=1 very easy to understand =5): 4 Rate the information in the paper is it sound, factual, and accurate?(poor=1 excellent=5): 4 Please explain why.: The ideas presented in this paper are easily understandable, with clear and concise descriptions that facilitate comprehension. Rate the paper on its contribution to the body of knowledge to this field (none=1, very important=5): 4 Rate the overall quality of the writing (very poor=1, excellent=5): 4 Does this paper cite and use appropriate references?: Yes If not, what important references are missing?: Is the treatment of the subject complete?: No If not, What important details / ideas/ analyses are missing?: In the paper, the target recommender models discussed are TiSASRec, BERT4Rec, and LSTM, with TiSASRec being the most recent, introduced in 2020. Considering the rapid advancements in the field, the resilience of recommender systems to perturbations may vary with newer models. Thus, it would be beneficial for the authors to demonstrate FINEST's effectiveness on more contemporary sequential recommender systems developed between 2020 and 2024, to provide a comprehensive understanding of its applicability and performance across different technological generations. Additionally, the two baseline methods introduced in the paper, Adversarial Poisoning Training and ACAE, were proposed before or in 2021. Given the continuous evolution in the field, it's worth exploring whether there have been newer baseline approaches developed between 2021 and 2024. Please help ACM create a more efficient time-to-publication process: Using your best judgment, what amount of copy editing do you think this paper needs?: Light Most ACM journal papers are researcher-oriented. Is this paper of potential interest to developers and engineers?: Yes