AgentCF: Collaborative Learning with Autonomous Language Agents for Recommender Systems Download PDF Junjie Zhang, Yupeng Hou, Ruobing Xie, Wenqi Sun, Julian McAuley, Xin Zhao, Leyu Lin, Ji-Rong Wen Published: 22 Jan 2024, Last Modified: 22 Jan 2024TheWebConf24Conference, Senior Area Chairs, Area Chairs, Reviewers, AuthorsRevisionsBibTeX Keywords: Agents, Large Language Models, Collaborative Learning TL;DR: The paper proposes using LLM-powered agents to simulate both users and items in recommender systems, by collaboratively optimizing these agents to capture the user-item interaction relations. Abstract: Recently, there has been an emergence of employing LLM-powered agents as believable human proxies, based on their remarkable decision-making capability. However, existing studies mainly focus on simulating human dialogue. Human non-verbal behaviors, such as item clicking in recommender systems, although implicitly exhibiting user preferences and could enhance the modeling of users, have not been deeply explored. The main reason lies in the gap between language modeling and behavior modeling, as well as the incomprehension of LLMs about user-item relations. To address this issue, we propose AgentCF for simulating user-item interactions in recommender systems through agent-based collaborative filtering. We creatively consider not only users but also items as agents, and develop a collaborative learning approach that optimizes both kinds of agents together. Specifically, at each time step, we first prompt the user and item agents to interact autonomously. Then, based on the disparities between the agents' decisions and real-world interaction records, user and item agents are prompted to reflect on and adjust the misleading simulations collaboratively, thereby modeling their two-sided relations. The optimized agents can also propagate their preferences to other agents in subsequent interactions, implicitly capturing the collaborative filtering idea. Overall, the optimized agents exhibit diverse interaction behaviors within our framework, including user-item, user-user, item-item, and collective interactions. The results show that these agents can demonstrate personalized behaviors akin to those of real-world individuals, sparking the development of next-generation user behavior simulation. Code is available at: https://anonymous.4open.science/r/AgentCF-WWW/. Track: User Modeling and Recommendation Submission Guidelines Scope: Yes Submission Guidelines Blind: Yes Submission Guidelines Format: Yes Submission Guidelines Limit: Yes Submission Guidelines Authorship: Yes Student Author: Yes Serve As Reviewer: Yupeng Hou Submission Number: 1299 Filter by reply type... Filter by author... Search keywords... Sort: Newest First 31 / 31 replies shown Paper Decision DecisionProgram Chairs21 Jan 2024, 23:40 (modified: 22 Jan 2024, 22:40)Program Chairs, Senior Area Chairs, Area Chairs, Reviewers, AuthorsRevisions Decision: Accept Comment: The paper introduces AgentCF, an innovative agent-based collaborative filtering approach for simulating user-item interactions in recommender systems. This method creatively leverages the capabilities of Large Language Models (LLMs) to optimize simulated user and item agents, thus enhancing the recommendation process. The proposed approach has been thoroughly evaluated on two datasets and demonstrates its effectiveness through comparison with several baseline methods. The paper's strengths lie in its originality and relevance to current research trends in recommender systems. The dual-agent perspective, treating both users and items as agents, represents a significant departure from traditional models, offering a more comprehensive understanding of user-item dynamics. The incorporation of memory modules for both agents, allowing for the integration of collaborative signals, is an inventive and valuable addition to the field. Furthermore, the experiments effectively validate the model's efficiency, particularly in sparse data scenarios. There are areas that could be improved for enhanced clarity and robustness. The presentation of the AgentCF framework in Figure 1 could be made clearer, with more detailed explanations to aid comprehension. The paper could benefit from more explicit definitions of the experimental objectives and a deeper discussion on the selection of baseline methods for comparison. The ablation study results should include more details about the different prompting strategies used. Additionally, the paper could explore further the interpretability of the model's decisions and recommendations, which would enhance its applicability and trustworthiness. Summary of Our Response Official CommentAuthors (Yupeng Hou, Wenqi Sun, Leyu Lin, Xin Zhao, +4 more)15 Dec 2023, 00:56Program Chairs, Senior Area Chairs, Area Chairs, Reviewers, Reviewers Submitted, Authors Comment: We sincerely thank the reviewers for their insightful and constructive feedback. We had great discussions with all the reviewers, and made our best efforts to address all the raised concerns. Since the discussion stage is coming to an end, we conclude our response to the concerns proposed by reviewers here: Clarification on collaborative optimization: We provide more clarity on AgentCF's optimization process, especially preference aggregation and propagation, along with a complementary figure. Please see this response, this response, and this response. Analysis of memory mechanism design: We describe the proposed memory mechanism. We further explain how memory modules benefit AgentCF in mitigating issues like catastrophic forgetting and limited context length. Please see this response and this response. Interpretability of AgentCF: We discuss how LLM-based agents enhance interpretability through language-based memory modules and direct prompting. Please see this response. Addressing various questions: We discuss some questions regarding dataset sampling, prompting strategies, and memory updation. Please see this response, this response, this response, and this response. We further clarify the experiment setting, baseline methods, and related work. Please see this response. Official Review of Submission1299 by Reviewer KRAK Official ReviewReviewer KRAK24 Nov 2023, 09:15 (modified: 14 Dec 2023, 11:40)Program Chairs, Senior Area Chairs, Area Chairs, Reviewers Submitted, Reviewer KRAK, AuthorsRevisions Review: This paper proposed an agent-based collaborative filtering approach for simulating user-item interactions. In the proposed method, they leveraged the capabilities of LLMs to optimize the simulated user agents and item agents, so their preferences can be propagated in subsequent interactions. They evaluated their proposed approach on two datasets and compared it with several baseline methods to demonstrate its effectiveness. Pros: The investigated topic of user behavior simulation is interesting and will be of interest to researchers in recommender systems. The proposed model learns the user-item interaction by autonomous interaction between user and item agents and enables optimization of user and item agents to capture two-sided relations. The evaluation and ablation study results show the proposed approach can simulate user-item interactions, achieving good performance on recommendation tasks. Suggestions for improvements: The presented framework of AgentCF in Figure 1, is a bit confusing to me with the current figure caption. The authors might consider adding more detailed explanations for the figure. For example, it is unclear how the figure shows “the simulated preferences of user and item agents aggregate (as indicated by the highlighted content) and can propagate to other agents in subsequent interactions.” For the experiment part, it is better to explicitly specify what questions they want to answer in the whole experiment. It would be easier to understand the rationale behind the experiment. For the baseline methods, it would be better to include more explanation on how the baseline methods were chosen for the experiment. For the ablation study, it seems the results only show AgentCF_B and the associated comparison instead of the other two prompting strategies. It is better to provide some explanations in the paper. For the related work section, it might be better to add some discussion on existing work on user behavior simulation. Questions: For the baseline methods, I wonder how the baseline methods were chosen for the experiment or the rationale behind the choice. For the ablation study, it seems the results only show AgentCF_B and the associated comparison instead of the other two prompting strategies. I wonder about the reason for this selection. Ethics Review Flag: No Ethics Review Description: No Scope: 3: The work is somewhat relevant to the Web and to the track, and is of narrow interest to a sub-community Novelty: 5 Technical Quality: 6 Reviewer Confidence: 2: The reviewer is willing to defend the evaluation, but it is likely that the reviewer did not understand parts of the paper Response to Reviewer KRAK (1) Official CommentAuthors (Yupeng Hou, Wenqi Sun, Leyu Lin, Xin Zhao, +4 more)07 Dec 2023, 04:31Program Chairs, Senior Area Chairs, Area Chairs, Reviewers Submitted, Authors Comment: Dear Reviewer KRAK, Thank you for your thoughtful review! We appreciate your note that we contribute an interesting interaction simulation framework. Below we answer your questions: 1. Detailed explanations for Figure 1 Thanks for your suggestion regarding adding more details to Figure 1. To offer further clarification on the preference aggregation and propagation process within the optimization phase, we have created a complementary figure. Please kindly refer to https://anonymous.4open.science/r/AgentCF-WWW/figure/illustration.png Below, we provide a detailed explanation of our approach to the reviewer, to address any potential lack of clarity regarding the key contributions in the original paper. The first step toward aligning the simulated agents with real-world individuals is to evaluate their ability to exhibit human-like behaviors and preferences. To do this, we prompt these agents to conduct autonomous interactions, by asking simulated user agents to select preferred items from candidates. We present both a positive item agent and a randomly sampled negative item agent for user agents to interact with. When agents make inconsistent decisions with real-world interaction records, we prompt them to conduct collaborative reflection and revise misaligned concepts in their memory. Specifically, the user agent is prompted to align its simulated preferences with the positive item's characteristics (as illustrated in Figure 1, following this reflection, the updated user agent memory contains both the original tastes and newfound preferences toward the positive item). The same process is also applied to the item agents. Therefore, during this process, the simulated preferences of user and item agents are exchanged and aggregated, enabling them to derive consistent decisions like in real-world cases. The updated item agent, which has incorporated preference information of previously interacted user agents, will further interact with other user agents and aggregate preferences, facilitating the exchange of preference information among user agents. This can enhance the propagation of preference information as traditional collaborative filtering. As you can see, both aggregation and propagation effects heavily rely on the interaction between the user and item agents. If there were no item agents, a user agent mainly learns preferences from the one-hop items that she/he has interacted with, since the items cannot aggregate information from related users. Setting item agents is key to recommender systems as it enables preference aggregation and propagation through agent interactions, thereby emulating the idea of collaborative filtering. We believe this is a very promising direction and would like to emphasize its importance that extending existing user agent-based simulation. We agree with the reviewer's comment about Figure 1 and its explanation, and will make appropriate revisions on this point. Response to Reviewer KRAK (2) Official CommentAuthors (Yupeng Hou, Wenqi Sun, Leyu Lin, Xin Zhao, +4 more)07 Dec 2023, 04:44Program Chairs, Senior Area Chairs, Area Chairs, Reviewers Submitted, Authors Comment: 2. Explicit specification of experiments. Thanks for the suggestion. Our experiments are designed to reveal the following questions: Q1: How do our optimized LLM-based agents exhibit personalized interaction behaviors to achieve high-quality recommendations? Q2: How does each component contribute to the overall performance? Especially, what is the impact of collaborative optimization on recommendation performance? Q3: Do the limitations or weaknesses of LLM (such as position bias and popularity bias) affect the performance of our approach, considering that our work is based on LLMs? Q4: Can our approach be extended to support additional types of interactions? Q5: How can we intuitively grasp the preference propagation effect of our approach, which is the most significant contribution of our work? Next, we present the correspondence of our experiments with the above four questions. [Q1: Main experiments, Section 3.2, Table 2] We compare the proposed AgentCF with several recommendation models and show its effectiveness (outperforming LLM-based recommenders and traditional recommenders trained on sampled datasets) in making recommendations. [Q2: Ablation study, Section 3.3.1, Table 3; Effectiveness of collaborative optimization, Section 3.3.2, Figure 2] We ablate each component of AgentCF (autonomous interaction, user agent, and item agent ), and indicate that all modules of AgentCF can contribute to the overall performance improvement. We evaluate whether the simulated agents indeed undergo continuous refinement throughout the collaborative optimization process, and show the effectiveness. [Q3: Bias study, Section 3.3.3, Figure 3] We evaluate the effectiveness of optimized agents in facing position bias and popularity bias of LLMs, and indicate that our simulated agents can make more personalized behaviors against these biases. [Q4: User-user interaction simulation, Section 3.4.1, Figure 4; Item-item interaction simulation, Section 3.4.2, Figure 5] We evaluate whether our simulated user agents can interact with each other in a manner resembling our human social interactions. Furthermore, we explore the potential of item-item interaction in alleviating the item cold-start problem. We find that item interaction enables cold-start items to acquire personalized memories from popular items. [Q5: Preference propagation study, Section 3.4.3, Figure 6 and Figure 7] To validate the phenomenon of preference propagation during the collaborative optimization process, we perform quantitative analysis and show that as user agents engage in continuous interactions, an increasing number of them exhibit similar preferences. We will follow the reviewer's suggestion to add these discussions about experimental design, to ease the following of our experiments. 3. Detailed explanation of baseline methods Thanks for the comment. We will revise the presentation by adding more discussions about the baseline selection. In our experiments, we consider a ranking setting in our recommendation tasks, and employ several classical and LLM-based recommendation models as baseline methods. These baselines are as follows: [BPR, Collaborative filtering-based model] Our approach implicitly models the collaborative filtering idea, where we replace the gradient optimization of traditional recommenders with language feedback. Therefore, we take BPR, one of the classical CF models, as the baseline. [SASRec, Sequential-based model] Our approach considers sequential factors in both the optimization and inference process, where we equip each user agent with short and long-term memory, and enable agents to mimic the real-world interactions in sequential order. Therefore, we take SASRec, one of the widely used and powerful sequential models, as the baseline. [Pop, Commonsense-based model] Our model simulates LLM-powered user and item agents. Typically, LLMs can rely on commonsense knowledge such as item popularity to make recommendations. Therefore, we take Pop as the baseline. [BM25, Semantic-based model] LLM also demonstrates impressive semantic encoding capacity. Therefore, we take BM25 as the baseline, which considers the textual similarity between user historical interactions and candidates to make recommendations. [LLMRank, LLM-based model] LLMRank directly prompts LLMs to provide recommendations based on user historical interactions. In contrast, we introduce a collaborative learning approach that optimizes both user and item agents to capture their two-sided relations. To verify the effectiveness of collaborative optimization, we take LLMRank as the baseline. Overall, Pop, BPR, and SASRec are widely used baselines in the literature of recommender systems. Furthermore, our work is involved with text information of items, and it is natural to consider search methods like classic BM25 as baselines. We also compare our work with recent LLM-based recommendation methods, which further complement our baselines. Response to Reviewer KRAK (3) Official CommentAuthors (Yupeng Hou, Wenqi Sun, Leyu Lin, Xin Zhao, +4 more)07 Dec 2023, 05:53Program Chairs, Senior Area Chairs, Area Chairs, Reviewers Submitted, Authors Comment: 4. The reason why we don't report the results of all the prompting strategies in the ablation study Thanks for the comment! We agree that it is important to include the results of all the prompting strategies in the ablation study. However, due to the space limitation, we only report the results of , though we have indeed conducted experiments involving two rest prompting strategies. We provide these results below and will add additional results in the appendix. As you can see, the overall performance trend is similar across all the prompting strategies. Variants N@1 N@10 N@1 N@10 N@1 N@10 Vanilla 0.2067 0.5328 0.2333 0.5405 0.2100 0.5198 Auto. Interaction 0.1200 0.4964 0.1400 0.5042 0.1367 0.4994 User Agent 0.1100 0.4693 - - - - Item Agent 0.1767 0.5128 0.1933 0.5233 0.1900 0.4956 Variants N@1 N@10 N@1 N@10 N@1 N@10 Vanilla 0.2067 0.5335 0.1933 0.5247 0.1600 0.5147 Auto. Interaction 0.1733 0.5031 0.1800 0.5113 0.1533 0.4890 User Agent 0.2200 0.5145 - - - - Item Agent 0.1800 0.5169 0.1667 0.5135 0.1933 0.5145 Notably, there is some overlap in the settings of different variants of ablation study and the prompting strategies. When we ablate the user agents, we represent each user with their historical interactions. In this case, the inclusion of AgentCF_{B+R} (retrieve specialized preferences from long-term memory) and AgentCF_{B+H} (add user historical interaction records) becomes meaningless. 5. Discussion about existing work on user behavior simulation Thanks for your valuable suggestion, and we will add a special part of user behavior simulation in the related work as the reviewer suggested. We presented the drafted content for user behavior simulation below. User behavior simulation has garnered much attention in research and industry communities[1,2]. It helps researchers gain insights into user behavior patterns and improve user-friendly services. In the realm of recommender systems, numerous efforts [3,4,5] have been undertaken to simulate user behaviors, involving the design of complex rules, supervised learning, or the employment of reinforcement learning paradigms. However, simulating believable user behavior is challenging due to the personalized nature of behaviors. Recent advancements in LLMs have prompted researchers to simulate behaviors based on user interactions [6,7]. However, we argue that LLMs may struggle to capture user underlying preferences, as they lack the understanding of domain-specific user behaviors and item catalogs. To address this, we propose AgentCF, an approach for simulating user-item interactions using agent-based collaborative filtering. [1] Stavinova et al. "Synthetic data-based simulators for recommender systems: A survey." ArXiv:2206.11338 (2022). [2] Bernardi et al. "Simulations in recommender systems: An industry perspective." ArXiv:2109.06723 (2021). [3] Kiyohara et al. "Accelerating Offline Reinforcement Learning Application in Real-Time Bidding and Recommendation: Potential Use of Simulation." ArXiv:2109.08331 (2021). [4] Huang et al. "Keeping dataset biases out of the simulation: A debiased simulator for reinforcement learning based recommender systems." RecSys. 2020. [5] Ie et al. "Recsim: A configurable simulation platform for recommender systems." ArXiv:1909.04847 (2019). [6] Huang et al. "Recommender AI Agent: Integrating Large Language Models for Interactive Recommendations." ArXiv:2308.16505 (2023). [7] Zhang et al. "On Generative Agents in Recommendation." ArXiv:2310.10108 (2023). Thank you! We sincerely appreciate your thoughtful questions and valuable advice. Please feel free to reach out if you have any additional queries. If you believe that our response has effectively addressed your concerns, we kindly ask if you would consider raising your rating score for our paper. Thank you for taking the time to consider our request! Gentle Reminder Official CommentAuthors (Yupeng Hou, Wenqi Sun, Leyu Lin, Xin Zhao, +4 more)11 Dec 2023, 06:43Program Chairs, Senior Area Chairs, Area Chairs, Reviewers Submitted, Authors Comment: Dear Reviewer, We hope you are doing well. We want to reach out to see if you have any further questions. If you do, we would appreciate the opportunity to respond before the discussion period ends. Thank you once again for your thoughtful review and help in improving the paper. We appreciate your time and consideration. Best regards, Authors of AgentCF Kindly Reminder Official CommentAuthors (Yupeng Hou, Wenqi Sun, Leyu Lin, Xin Zhao, +4 more)13 Dec 2023, 18:28 (modified: 13 Dec 2023, 18:29)Program Chairs, Senior Area Chairs, Area Chairs, Reviewers Submitted, AuthorsRevisions Comment: Dear Reviewer, We hope you are doing well. Since the discussion period will come to an end tomorrow, December 14th, could you kindly share with us if you have any additional questions or concerns? We would be delighted to carry on with the conversation. We extend our sincere gratitude for your insightful review and your valuable assistance in enhancing the paper. Best regards, Authors of AgentCF Replying to Response to Reviewer KRAK (3) Response to Authors Official CommentReviewer KRAK14 Dec 2023, 11:40Program Chairs, Senior Area Chairs, Area Chairs, Reviewers Submitted, Authors Comment: Thanks authors for the detailed rebuttal. I have read them and updated my score. Thanks for your response and support Official CommentAuthors (Yupeng Hou, Wenqi Sun, Leyu Lin, Xin Zhao, +4 more)14 Dec 2023, 18:07Program Chairs, Senior Area Chairs, Area Chairs, Reviewers Submitted, Authors Comment: Thanks for your valuable support and raising the rating. We genuinely appreciate your suggestions, which help us improve the work. Official Review of Submission1299 by Reviewer Fagi Official ReviewReviewer Fagi23 Nov 2023, 09:39 (modified: 13 Dec 2023, 18:49)Program Chairs, Senior Area Chairs, Area Chairs, Reviewers Submitted, Reviewer Fagi, AuthorsRevisions Review: Summary: AgentCF proposes a novel approach for simulating user-item interactions in recommender systems through agent-based collaborative filtering. The paper creatively considers both users and items as agents, developing a collaborative learning approach that optimizes these agents together. The proposed methodology involves prompting user and item agents to interact autonomously, followed by collaborative reflection to adjust any misleading simulations based on real-world interaction records. Extensive experiments validate the effectiveness of the proposed approach. It demonstrates that AgentCF achieves superior or comparable performance to traditional recommendation models when trained on similarly scaled datasets. Notably, the model performs effectively with sparse data scenarios, training on only approximately 0.07% of the complete dataset. Pros: The paper's approach of treating both users and items as agents in a recommender system is highly original. This dual-agent perspective is a significant departure from traditional recommender systems that typically focus on user preferences without equally emphasizing the role of items as interactive agents. Incorporating memory modules for both user and item agents is an inventive aspect of the paper. This feature allows for the integration of collaborative signal in the user-item interactions. Utilizing language to explicit model and refine CF signals is a relatively unexplored concept in the field of recommender systems. And the introduction of a collaborative reflection mechanism as a training process for mutual memory update between user and item agents is a novel element. This process allows the both user and item agents to adjust their behavior based on discrepancies between simulated and real-world interactions, adding a layer of sophistication to the model. Experiments validate that AgentCF is effective in the scenarios of few data available, especially in the sparse dataset. The paper is well organized, clearly written and easy to understand. Cons: The contribution is limited by the overly sampling of the datasets (only select 1% of the user in the original dataset). In such small dataset, the experimental results which compared to traditional baselines are unconvinced. Continously updating and reflecting the memory during simulation leads to a overwhelm of text information in the memory of both users and items. And it is highly potential that catastrophic forgetting occur to fit recent interaction without consideration of early interaction. Well-designed mechanism should be apply to effectively utilize the information contained in the memory. The paper could delve deeper into the interpretability of the model's decisions and recommendations. Given the complexity of the model's collaborative reflection and autonomous interactions, providing insights into why certain recommendations are made could enhance its applicability and trustworthiness. Questions: AgentCF demonstrates commendable performance in situations where training data is sparse. Its success is largely due to leveraging a Language Model (LLM) that's pre-trained on extensive corpora, thus efficiently utilizing semantic information in few-shot scenarios. However, there's a concern regarding its enhanced performance with increased data. Given that LLMs, despite their robustness, might struggle with longer prompts, the rising complexity and length of inputs could adversely affect the LLM's reasoning capabilities and context maintenance. This issue could lead to reduced performance gains as the data volume grows, potentially impacting the overall contribution of this work. It would be beneficial if the authors could address this challenge, potentially improving the system's adaptability to diverse data volumes and further solidifying the work's contribution. Ethics Review Flag: No Ethics Review Description: No Scope: 4: The work is relevant to the Web and to the track, and is of broad interest to the community Novelty: 6 Technical Quality: 6 Reviewer Confidence: 4: The reviewer is certain that the evaluation is correct and very familiar with the relevant literature Response to Reviewer Fagi (1) Official CommentAuthors (Yupeng Hou, Wenqi Sun, Leyu Lin, Xin Zhao, +4 more)07 Dec 2023, 06:08Program Chairs, Senior Area Chairs, Area Chairs, Reviewers Submitted, Authors Comment: Dear Reviewer Fagi, Thank you for engaging with our work and for noting the inventiveness of our methodology and experimental setup. We would like to address your concerns in detail below. 1. Clarification about the dataset sampling Thanks for the insightful comment. To respond to this issue, we clarify the experimental settings for dataset sampling below. Notably, we provide versions of BPR and SASRec that have been trained on both sampled and full training data. The full-data trained versions are denoted as BPR_{ful} and SASRec_{full}. Please refer to the comparison results in Table 2. We believe that the incorporation of full-data trained BPR and SASRec actually leads to a more fair comparison. As observed, our model can achieve comparable or even superior performance to BPR_{full} and SASRec_{full} in sparse scenarios. Despite that our model performs slightly worse than traditional models in dense scenarios, it is worth noting that our model is trained using only around 1% of the overall training data used for training BPR_{ful} and SASRec_{full}. These results can also demonstrate the generalization capability of our approach. The scaling of agents remains an unsolved challenge, due to the low efficiency (1.5 hours for conducting collaborative reflection within 100 users and their interacted items in our work) and high cost ($115 for completing experiments listed in Table 2) of agent communication. Although our current experiments are limited to several hundred agents for each dataset, it is promising to be extended to a large scale, if the scaling approach of agents had been developed, e.g., small models can be fine-tuned to reduce the cost of scaling agents[3,4,5]. While, it should be noted that the number of agents in our work is still larger compared to several existing studies on LLM-based simulation (e.g., 25 in [1] and 7 in [2]). In our future work, we will investigate more lightweight LLMs, such as LLaMA, to expand the scale of simulated agents. Our proposed AgentCF is a small step to explore the simulation of interaction behaviors using LLM-powered agents, which sheds light on possibilities of improved user modeling, item recommendation, resource allocation, and new strategy tests in recommender systems. It also provides a new paradigm, language feedback-based agent simulation, to understand traditional collaborative filtering recommenders that parameterize user and item representations. We believe this approach is highly valuable to explore, while acknowledging the reviewer's point that there is room for improvement in various settings and technical methods [1] Park et al. "Generative agents: Interactive simulacra of human behavior." UIST. 2023. [2] Xu et al. "Language agents with reinforcement learning for strategic play in the werewolf game." ArXiv:2310.18940 (2023). [3] Chen et al. "Fireact: Toward language agent fine-tuning." ArXiv:2310.05915 (2023). [4] Zeng et al. "Agenttuning: Enabling generalized agent abilities for llms." ArXiv:2310.12823 (2023). [5] Yin et al. "Lumos: Learning Agents with Unified Data, Modular Design, and Open-Source LLMs." ArXiv:2311.05657 (2023). Response to Reviewer Fagi (2) Official CommentAuthors (Yupeng Hou, Wenqi Sun, Leyu Lin, Xin Zhao, +4 more)07 Dec 2023, 06:37Program Chairs, Senior Area Chairs, Area Chairs, Reviewers Submitted, Authors Comment: 2. Clarification about the memory updation Thanks for your suggestions. Actually, it is an important issue that all the historical data can be utilized during the agent simulation. In our approach, we have specially designed two strategies that can alleviate this problem. (1) Regarding memory mechanism, we equip each user agent with both short-term memory (a text for recording recent interactions) and long-term memory (a pool for storing past experience). For each step of updation, the long-term memory will append the previous short-term memory, and the short-term memory will be incorporated with the newfound preferences using LLMs. It is also important to note that, as illustrated in Table 2, incorporating relevant information from long-term memory (AgentCF_{B+R}) can yield superior results compared to solely relying on user short-term memory to prompt LLMs (AgentCF_B). This verifies the effectiveness of our proposed memory mechanism. (2) Regarding prompt design, we refine the prompting strategies for short-term memory updation. This includes: Summarizing past preferences from the un-updated short-term memory, Exploring new preferences based on the new interaction, Merging the past and new information while removing any conflicting or repetitive parts. We would like to showcase an example of how the short-term memory of a user agent can be updated: Previous memory: "After listening to the CD 'High 'n' Dry', I realized that I have a newfound preference for CDs that fall under the categories of 'Alternative Rock' and 'New Wave & Post-Punk'. These genres provide a unique and captivating listening experience that resonates with my current musical taste." Interacting with a CD: "The CD is called 'Masterpeace'. The category of this CD is: 'Metal; Pop Metal'." Updated memory: "I now have a newfound preference for CDs that fall under the category of 'Pop Metal', as exemplified by the CD 'Masterpeace'. This genre offers a unique blend of metal and pop elements that I find enjoyable and engaging. Additionally, I had the enjoyment of CDs categorized as 'Alternative Rock' and 'New Wave & Post-Punk. These genres continue to resonate with my musical taste and offer a captivating listening experience. " As we can see, our designed prompt strategy enables the agents to update its memory by combining its past preferences and new preferences, thereby alleviating the catastrophic forgetting problem. In the revised version of this paper, we will follow the reviewer's comment to include more discussions about the catastrophic forgetting problem and our strategies for mitigating it. While, it should be noted that this is still a challenging problem for various machine learning tasks, and we intend to explore additional methods to alleviate it in our future work. 3. Interpretability of AgentCF Thanks for the insightful comment. Actually, unlike traditional methods, LLM-based agents are developed using natural language (e.g., memory, feedback, and action), and it can naturally enhance the interpretability of the model's decisions and recommendations. Next, we detail this point in two major aspects. (1) We have incorporated memory modules into both user agents and item agents. These modules store simulated user preferences and potential adopters of items in natural language, which naturally provides an interpretable way to understand how our approach works. Next, we list two examples to illustrate this. An optimized user agent memory that demonstrates simulated preferences: "I have a fondness for CDs in the Classic Rock and AOR genres, exemplified by my enjoyment of CDs like 'Desperado' and 'Back in Black'. I appreciate the raw energy that is often associated with these genres." An optimized item agent memory that showcases its intrinsic features and tastes of potential adopters: "With thought-provoking lyrics and exceptional musical talent, it offers a unique and eclectic listening experience that showcases the creativity of the artist. This CD aligns with the preferences of fans who enjoy CDs that offer diverse melodies, showcasing a love for progressive rock, and world music." (2) Furthermore, we can also design suitable prompting strategies to directly generate the explanation for a recommendation based on the user agent memory and item agent memory. For example, we find that the user agent mentioned above chose to interact with the above item, providing the following explanation: "This CD aligns perfectly with my preferences for Progressive Rock genres. It offers diverse melodies, showcasing a love for world music. The thought-provoking lyrics and exceptional musical talent make it an eclectic listening experience, which resonates with my appreciation for raw energy associated with these genres." We will follow the reviewer's suggestion to add more discussions about the interpretability of our approach. Response to Reviewer Fagi (3) Official CommentAuthors (Yupeng Hou, Wenqi Sun, Leyu Lin, Xin Zhao, +4 more)07 Dec 2023, 06:45 (modified: 07 Dec 2023, 08:28)Program Chairs, Senior Area Chairs, Area Chairs, Reviewers Submitted, AuthorsRevisions Comment: 4. Discussion about the limited context length of LLMs This is an interesting and challenging question. To solve this problem, we employ the following strategies in our approach: (1) Retrieving relevant information from memory. During inference, the user agent can retrieve specific details from its long-term memory, rather than just concatenating all memories into one prompt. This enables agents to provide more specialized and informative responses. Comparing the results of these two methods, we can find that directly inputting all memories into LLMs performs worse than prompting LLMs to model specific details using the retrieval paradigm. This validates the effectiveness of our approach. Variants N@1 N@5 N@10 0.2333 0.4142 0.5405 0.1933 0.3938 0.5126 (2) Using LLMs with long context length. Recently, LLMs have been extending their maximum context length. For example, LLaMA 2 supports a maximum input length of 4096 tokens, and Claude 2 even extends this limit to 200k tokens. In our experiment, we develop agents using the gpt-3.5-turbo-16k, which supports an input length of 16k tokens. This enables us to conduct all the experiments demonstrated in this paper. Nevertheless, efficiently extending the input length of LLMs remains an unresolved challenge. We will continue to focus on the development of long sequence modeling, to improve the adaptability and robustness of our approach. In addition, we will consider designing more robust strategies to mitigate potential issues of LLM prompting, which will be discussed and added to future work in our revised version. Thank you! We sincerely appreciate your insightful questions and valuable advice. If you have any more questions, we're more than willing to continue the discussion. If you find that our response addresses your concerns, could you please consider increasing your rating score for our paper? Your consideration is highly appreciated. Gentle Reminder Official CommentAuthors (Yupeng Hou, Wenqi Sun, Leyu Lin, Xin Zhao, +4 more)11 Dec 2023, 06:43Program Chairs, Senior Area Chairs, Area Chairs, Reviewers Submitted, Authors Comment: Dear Reviewer, We hope you are doing well. We want to reach out to see if you have any further questions. If you do, we would appreciate the opportunity to respond before the discussion period ends. Thank you once again for your thoughtful review and help in improving the paper. We appreciate your time and consideration. Best regards, Authors of AgentCF Kindly Reminder Official CommentAuthors (Yupeng Hou, Wenqi Sun, Leyu Lin, Xin Zhao, +4 more)13 Dec 2023, 18:30Program Chairs, Senior Area Chairs, Area Chairs, Reviewers Submitted, Authors Comment: Dear Reviewer, We hope you are doing well. Since the discussion period will come to an end tomorrow, December 14th, could you kindly share with us if you have any additional questions or concerns? We would be delighted to carry on with the conversation. We extend our sincere gratitude for your insightful review and your valuable assistance in enhancing the paper. Best regards, Authors of AgentCF Replying to Gentle Reminder Reponse to rebuttal Official CommentReviewer Fagi13 Dec 2023, 18:48Program Chairs, Senior Area Chairs, Area Chairs, Reviewers Submitted, Authors Comment: Thanks for the detailed rebuttal from the authors. After reading the rebuttal, most of my concerns are well addressed. I would like to improve the techinal score to 6. Thanks for your response and support Official CommentAuthors (Yupeng Hou, Wenqi Sun, Leyu Lin, Xin Zhao, +4 more)13 Dec 2023, 19:25Program Chairs, Senior Area Chairs, Area Chairs, Reviewers Submitted, Authors Comment: We are happy that our responses have addressed your concerns. We would like to express our sincerest gratitude once again for taking the time to review our paper and provide us with such detailed and invaluable suggestions! Official Review of Submission1299 by Reviewer SKAF Official ReviewReviewer SKAF20 Nov 2023, 20:33 (modified: 12 Dec 2023, 22:17)Program Chairs, Senior Area Chairs, Area Chairs, Reviewers Submitted, Reviewer SKAF, AuthorsRevisions Review: Pros: The paper explores a significant and timely research area by delving into LLM-based recommendation systems and introducing an interesting concept, the item agent. Providing an anonymous code repository adds value to the paper by enhancing reproducibility. The authors present comprehensive experiments, effectively conveying their findings through well-organized tables and graphs. Cons: Certain crucial concepts and claims lack sufficient explanation, diminishing the paper's clarity. Although Figure 1 illustrates the workflow of the AgentCF framework, its complexity may confuse readers. The figure doesn't effectively highlight the pivotal role of item agents, and the inclusion of extensive text within the figure is discouraged as it diminishes the advantages typically associated with visual representations. Questions: The authors mention that "item information is relatively stable through time". Based on this, what is the necessity of the item agent than the description or item title? "We find that LLMs tend to over-complain the negative item agent's drawbacks, disregarding the fact that it may also be attractive for other users." How do the authors get this conclusion? Are there any citations or experiments supporting this claim? What do , and stand for? Why does not need long-term memory in Eq. (8)? use more information ( ) in Eq. (7) than in Eq. (6). Why performs worse than in some cases? Ethics Review Flag: No Ethics Review Description: N/A Scope: 4: The work is relevant to the Web and to the track, and is of broad interest to the community Novelty: 5 Technical Quality: 5 Reviewer Confidence: 3: The reviewer is confident but not certain that the evaluation is correct Response to Reviewer SKAF (1) Official CommentAuthors (Yupeng Hou, Wenqi Sun, Leyu Lin, Xin Zhao, +4 more)07 Dec 2023, 07:00Program Chairs, Senior Area Chairs, Area Chairs, Reviewers Submitted, Authors Comment: Dear Reviewer SKAF, We thank the reviewer for engaging with our work, and recognizing the significance of our proposed item agent. Below we answer your questions: 1. Clarifications about crucial concepts and claims Sorry for the unclear explanation of our proposed approach. We will make revisions to the presentation by incorporating additional explanations regarding the crucial concepts and claims. To respond to this issue, we clarify the proposed concepts and claims below. Key concepts: We explain three key concepts as follows: User and item agent. In contrast to existing user-oriented simulation approaches, we employ LLM-powered agents to simulate both users and items in this paper. By doing this, we aim to simulate user-item interaction through agents' autonomous interaction and capture user-item relations via collaborative reflection. Autonomous interaction. We encourage user agents to select a preferred item from a contrastive pair consisting of a positive item and a randomly sampled item. This enables us to evaluate the behavior alignment between the simulated agents and real-world individuals. Collaborative reflection. If the simulated agent's decisions are inconsistent with real-world interaction records, it indicates that the simulated agents have not yet aligned with real individuals. In this case, we prompt the agents to reflect on the misleading concepts in their simulated memory, by aligning user agent memory with the pos item agent memory collaboratively. This enables agents to capture the user-item interaction relations, and further align their decision-making process with those of real individuals. Key claims: A major contribution is that our simulated agents can implicitly capture the collaborative filtering idea by conducting collaborative optimization. As mentioned above, the memories of user agents and positive item agents progressively exchange and aggregate through collaborative reflection. Subsequently, these updated item agents, enriched with preference information of previously interacted user agents, can further interact with other user agents, thereby facilitating the propagation of preferences from different users. This enables user agents with similar interaction behaviors to develop similar preferences, which lies at the core of collaborative filtering. We will follow the reviewer's suggestion to add more detailed explanations about the crucial concepts and claims. 2. Highlight the role of item agents and clarify details in Figure 1 Thanks for your suggestion regarding emphasizing the role of item agents. To provide more clarity regarding the importance of item agents, as well as the preference aggregation and propagation process during the optimization phase, we create a complementary figure. Please kindly refer to https://anonymous.4open.science/r/AgentCF-WWW/figure/illustration.png Next, we clarify the role of item agents in our approach. Generally speaking, the introduction of item agents allows for modeling the relationships between users and items in this two-sided interaction process. As mentioned before, during collaborative optimization, both kinds of user and item agents can interact and mutually align their simulated preferences. This can be observed in the highlighted content of Figure 1, where the simulated preferences of user and item agents are exchanged and aggregated. These updated agents then proceed to interact with other agents, repeating the same process of memory exchange and update. This enables preference propagation and captures the fundamental concept of "like alike" in collaborative filtering, where users tend to prefer items adopted by similar users. It is evident that this process heavily relies on both user and item agents, while existing work often neglects item-side modeling. Without item agents, a user agent mainly learns preferences from the one-hop items that they have interacted with, as the items cannot aggregate information from related users. The role of item agents can be understood to enhance the multi-hop relatedness in the user-item interaction graphs (as illustrated in the bottom left corner of Figure 1). Overall, We believe in the potential of including item agents to perform interaction simulations, which mimics the idea of collaborative filtering with collaborative optimization. We appreciate the reviewer's suggestion. We will incorporate this feedback and update Figure 1 in the next version, by eliminating extensive text. Response to Reviewer SKAF (2) Official CommentAuthors (Yupeng Hou, Wenqi Sun, Leyu Lin, Xin Zhao, +4 more)07 Dec 2023, 07:19 (modified: 07 Dec 2023, 08:37)Program Chairs, Senior Area Chairs, Area Chairs, Reviewers Submitted, AuthorsRevisions Comment: 3. Clarification about the evolution of item agents We are sorry that this statement is not well presented in our writing. It is intended to mention that the evolution of user agent preferences is more dynamic compared to item agents. This idea is verified by existing research on user multi-interest modeling [1,2,3], which typically represents a single user with multiple vectors encoding various facets of their interests, while an item is typically represented by a single ID embedding. It should be noted that even though item agents have evolved relatively steadily, they play a crucial role in achieving the modeling of user-item relations and the collaborative filtering idea. Conversely, solely relying on description text for item representation limits the scope of preference learning. Without item agents, user agents are confined to considering only the one-hop items they've directly interacted with, leading to suboptimal user preference modeling. Additionally, the ablation study presented in Table 3 demonstrates that directly representing items with their corresponding description text leads to inferior performance. To facilitate easier referencing for the reviewer, we have presented the results below. This finding highlights the effectiveness of behavior-involved item-side modeling. Variants N@1 N@10 0.2067 0.5328 0.1767 0.5128 [1] Li et al. "Multi-interest network with dynamic routing for recommendation at Tmall." CIKM. 2019. [2] Xiao et al. "Deep multi-interest network for click-through rate prediction." CIKM. 2020. [3] Cen et al. "Controllable multi-interest framework for recommendation." SIGKDD. 2020. 4. Clarification about the updation of negative item agents Thanks for the insightful comment. To respond to this query, we make further discussion on the effect of negative samples below. Actually, we have already explored the effect of updating negative item agents on collaborative optimization when designing the model. Our finding is that the updation of negative items can reduce the model performance. To facilitate easy reference for the reviewer, we provide these results below. As you can see, the model's recommendation performance declines significantly when negative item agents are updated. Variants N@1 N@10 0.2067 0.5328 0.1700 0.5104 We delve into the underlying reasons for this result below. We first check an example of target item memory in a failed recommendation scenario, which is updated as a negative item during the training process. As we can see, despite incorporating universal characteristics, the updated information of the negative item retains an excessively negative nature. The updated memory: "Don't Fear The Reaper: The Best Of Blue Öyster Cult 4 offers a collection of classic rock and album-oriented sound, which may appeal to those who enjoy a blend of hard rock and psychedelic elements. However, it may not resonate with individuals seeking a guitar-centric sound that showcases technical prowess and intricate guitar solos. The focus of this CD is not on rock guitarists and their virtuosity, which may disappoint those who appreciate a diverse musical experience." Then, to evaluate whether LLMs tend to "suppress" candidates with negative descriptions, we conduct a deeper analysis experiment. We equip all the target items with such negative information and evaluate the recommendation performance. The result is as follows. We observe that LLMs do indeed suppress the negative items, possibly due to an excessive alignment of LLMs. We also get a similar observation in Section 3.4.1. Variants N@1 N@10 0.2067 0.5328 0.1433 0.4523 In conclusion, we do not update negative item agents during the collaborative optimization process. We will add additional results in the appendix, to ease the following of our model design. Response to Reviewer SKAF (3) Official CommentAuthors (Yupeng Hou, Wenqi Sun, Leyu Lin, Xin Zhao, +4 more)07 Dec 2023, 07:37Program Chairs, Senior Area Chairs, Area Chairs, Reviewers Submitted, Authors Comment: 5. Discussion about the prompting strategies We propose three prompting strategies for agents to infer the potential user-item interactions. (1) is the basic prompting strategy. It incorporates the preferences of user agents from their short-term memories, as well as the characteristics of candidate items from their memories. (2) Building upon , further integrates the past experience of user agents. It retrieves specialized preferences from long-term memory by using the characteristics of candidates as queries. This allows for more specialized responses compared to directly using all long-term memories as prompts. (3) combines the short-term memory of user agents with their historical interactions. This enables LLMs to serve as sequential recommenders, overcoming challenges posed by sparse interaction records and limited preference propagation. Taking user historical interactions as prompts is also a commonly used strategy. Based on the above discussion, we next explain why we don't incorporate with long-term memory. It is important to note that both R_{B+R} and R_{B+H} strategies aim to incorporate long-term behavior patterns of user agents to enhance the prompt, making them share some historical information. This can lead to a waste of computational resources when combined. Moreover, combining them may result in excessive prompt text, which can harm the performance of LLMs. We compare the recommendation performance of combining both strategies and using them separately as follows. The results demonstrate that, incorporating all this information together ( ) can not yield significant improvement. Therefore, due to the consideration of efficiency and performance, we do not introduce this prompting strategy. Variants N@1 N@5 N@10 0.2067 0.4078 0.5328 0.2333 0.4142 0.5405 0.2100 0.4167 0.5198 0.1933 0.4187 0.5287 6. Analysis of the performance difference between and In our experiments, we find that, with the exception of the Office_{dense} dataset, consistently outperforms in all other scenarios. We speculate that this could be because the target items in the Office_{dense} dataset might have stronger connections with recent user interactions, making (which is focused on short-term memory) more effective. We conduct experiments to verify whether Office_{dense} exhibits such a "recency-focused" pattern. Specifically, we evaluate the results of SASRec when modeling the truncated user historical interaction sequences. The results are presented in the table below. We find that by focusing solely on recent interactions, SASRec's performance improved. This confirms the special data pattern of the Office_{dense} dataset, which enables the agent to better infer target items by leveraging short-term memory to simulate user preferences. On the other hand, retrieving long-term experiences could potentially impact LLM's attention distribution and inferences, leading to poorer recommendation results. Variants N@1 N@5 N@10 SASRec_{all interactions} 0.4700 0.6226 0.6959 SASRec_{recent four interactions} 0.4700 0.6327 0.6974 SASRec_{recent three interactions} 0.4800 0.6420 0.7059 SASRec_{recent two interactions} 0.4900 0.6410 0.7090 SASRec_{recent one interactions} 0.4800 0.6324 0.7000 Moreover, we recognize the need for refinement in the retrieval mechanism of . In our approach, we employ the characteristics of candidates as queries to retrieve specialized information from long-term memory. However, this method may occasionally retrieve information relevant to negative candidates rather than positive items, which can lead to suboptimal recommendations. In our future work, we intend to explore more advanced methods to enhance the retrieval results and improve the recommendation. Thank you! We deeply value your exceptional questions and insightful suggestions. Please feel free to reach out if you have additional questions. If you find that our response addresses your concerns, would you kindly consider raising your rating score for our paper? We greatly appreciate your consideration. Gentle Reminder Official CommentAuthors (Yupeng Hou, Wenqi Sun, Leyu Lin, Xin Zhao, +4 more)11 Dec 2023, 06:42Program Chairs, Senior Area Chairs, Area Chairs, Reviewers Submitted, Authors Comment: Dear Reviewer, We hope you are doing well. We want to reach out to see if you have any further questions. If you do, we would appreciate the opportunity to respond before the discussion period ends. Thank you once again for your thoughtful review and help in improving the paper. We appreciate your time and consideration. Best regards, Authors of AgentCF Replying to Gentle Reminder Thanks to the Authors Official CommentReviewer SKAF12 Dec 2023, 22:18Program Chairs, Senior Area Chairs, Area Chairs, Reviewers Submitted, Authors Comment: Dear Authors, Thank you for the detailed comments. I have read them and updated my score. Thanks for your response and support Official CommentAuthors (Yupeng Hou, Wenqi Sun, Leyu Lin, Xin Zhao, +4 more)12 Dec 2023, 23:10Program Chairs, Senior Area Chairs, Area Chairs, Reviewers Submitted, Authors Comment: Thanks for your response and support. We are glad to know that our rebuttal has addressed your concerns. Please let us know in case there remain concerns. Official Review of Submission1299 by Reviewer sVXi Official ReviewReviewer sVXi20 Nov 2023, 05:03 (modified: 01 Dec 2023, 22:27)Program Chairs, Senior Area Chairs, Area Chairs, Reviewers Submitted, Reviewer sVXi, AuthorsRevisions Review: SUMMARY In this paper, the authors propose an agent-based framework that exploits Large Language Models to simulate the interaction of a user with a recommender system. In particular, the authors introduce the novel and intriguing idea of seeing also the item as an agent, with its own beliefs and representation. Another interesting idea that is spread with the paper is that gradient optimization can be simulated by means of a multi-turn interaction. The experiments confirmed the effectiveness of the intuitions and the novelty of the approach, while some of the aspects (detailed comments follow) are not completely convincing. STRONG POINTS Novel and timely topic Solid methodology, that borrows concepts from agent-based architectures Good experimental results WEAK POINTS -Sub-optimal choices for the experiments. Limited findings. DETAILED COMMENTS The impact of sampling on the overall results is not clear Line 209: Different from traditional recommendation models, the LLM that 208 implements 𝑓LLM (·) will be fixed during the optimization process. - what do you mean by fixed? Line 412: At each step of interaction (i.e., optimization), we iterate the 411 selection process (Section 2.2.2) and the collaborative reflection 412 process (Section 2.2.3) --> do you repeat the process on the same item, or do you continue with a new one? Line 734: typo - copmarison The main concern I have with the training process is the collaborative reflection part. It should be better explained how and when the process is started. In particular, it is not clear to me in which moment of the training the information coming from other users and prompted to change the representation of the user itself and the items. I guess that the order has some importance here. Questions: Please clarify the choice of the dataset. Line 209: what do you mean by "fixed" recommendation model? Line 412: At each step of interaction (i.e., optimization), we iterate the 411 selection process (Section 2.2.2) and the collaborative reflection 412 process (Section 2.2.3) --> do you repeat the process on the same item, or you continue with a new one? How does the system perform in a complete cold-start scenario (i.e., no ratings for a particular user) Please better explain when and how the collaborative reflection part comes into play. It is not clear to me the exact moment when the information coming from peers is exploited to represent the current model of the user. Ethics Review Flag: No Ethics Review Description: No ethical concerns Scope: 3: The work is somewhat relevant to the Web and to the track, and is of narrow interest to a sub-community Novelty: 6 Technical Quality: 5 Reviewer Confidence: 3: The reviewer is confident but not certain that the evaluation is correct Response to Reviewer sVXi (1) Official CommentAuthors (Yupeng Hou, Wenqi Sun, Leyu Lin, Xin Zhao, +4 more)07 Dec 2023, 07:53 (modified: 07 Dec 2023, 08:44)Program Chairs, Senior Area Chairs, Area Chairs, Reviewers Submitted, AuthorsRevisions Comment: Dear Reviewer sVXi, Thank you for your thoughtful review. We appreciate your note that we contribute a solid methodology that exploits LLMs to simulate user-item interactions in recommender systems. Below we answer your questions: 1. Discussion about the effect of dataset sampling and choices on results We appreciate your questions about the effect of dataset sampling and choices. To respond to this issue, we provide clarification on our sampling strategies and analyze the effects. Regarding the dataset sampling and choices, currently, we randomly sample subsets from "CDs" and "Office" datasets, with each subset containing 100 users. To reduce the randomness and evaluate the efficacy of our model in performing diverse interaction scenarios, considering the effect of data sparsity on collaborative filtering recommenders, we sample both dense and sparse subsets based on the data sparsity. Therefore, we conduct experiments using a total of four subsets. Regarding the effect of dataset sampling, to provide a more fair comparison, we have reported the performance of traditional recommendation models when trained on the complete datasets. As illustrated, although our model conducts optimization on just approximately 1% of the training data, it can achieve comparable performance to full-data trained models in several cases. This indicates the potential of our model in simulating user-item interactions. it is noted that the low efficiency of agent communication poses a significant challenge to scale agents. For example, in our work, the simulation process takes 1.5 hours to conduct collaborative reflection per dataset and costs $115 to complete the experiments listed in Table 2. Therefore, previous studies on LLM-based simulation typically involve a limited number of agents, ranging from several dozens to a few, such as 25 of [1]. In contrast, to reduce randomness, our work includes simulating 100 users and their corresponding items, finally involving several hundred agents (we consider both users and items as agents). Nevertheless, it's essential to recognize that AgentCF is just an initial step in exploring the potential of LLM-based simulation within the realm of recommender systems. We remain committed to continuously leveraging advanced technologies and more lightweight models to scale the size of agents, thereby achieving more robust simulations. [1] Park et al. "Generative agents: Interactive simulacra of human behavior." UIST. 2023. 2. Clarification of the used notation "fixed" In this paper, "fixed" means that we do not directly fine-tune LLMs on the interaction data. Instead, we leverage language feedback to update the agent's memory. Given the remarkable capabilities of LLMs, it is natural to employ LLMs in performing downstream tasks, such as recommendation and agent-based simulation, in zero/few-shot settings. We are also open to exploring advanced methods for fine-tuning LLMs, to reduce the overheads of API calls and enhance the efficiency of our approach. 3. Clarification of details in collaborative optimization We repeat the interaction and reflection process on the same item. Below we explain why we repeat this process. The objective of collaborative optimization is to align the simulated agents with real-world interaction records, allowing them to capture the two-sided relations between users and items. To achieve this, the first key step is to assess the alignment between these agents and real individuals, thereby generating feedback to refine their memories. Therefore, we deliberately introduce popularity and position bias in candidate selection and prompt agents to conduct autonomous interaction. In this case, although collaborative reflection enables these agents to refine their memory and achieve better alignments with real individuals, LLMs may still suffer from the introduced bias and make incorrect choices. Thus, we repeat the optimization process until user agents make correct decisions or reach the maximum round of interactions, aiming to maximize the alignment between agents and real individuals. 4. Typo fixing Thanks for pointing out the spelling error in this paper. We appreciate your diligence. We will rectify this typo in the next revision. Response to Reviewer sVXi (2) Official CommentAuthors (Yupeng Hou, Wenqi Sun, Leyu Lin, Xin Zhao, +4 more)07 Dec 2023, 08:03 (modified: 07 Dec 2023, 08:08)Program Chairs, Senior Area Chairs, Area Chairs, Reviewers Submitted, AuthorsRevisions Comment: 5. Analysis on the cold-start scenario Thanks for the insightful comment. Providing recommendations in completely cold-start scenarios is indeed an interesting and challenging issue. We believe that this challenge can be alleviated by leveraging the impressive generalization capabilities of LLMs. In general, it is non-trivial for traditional ID-based recommendation models like BPR and SASRec to provide recommendations without any user interaction records. On the other hand, LLMs, based on their encoded universal knowledge, have the potential to make cold-start recommendations by leveraging various information, including user profiles, queries, and even contextual information [1,2]. It is important to note that the key of our work is employing user interaction records to optimize the LLM-powered agents for simulation, reflection, and refinement. Therefore, when there are no historical records available, the LLM-powered agents will generate general behaviors using the universal knowledge they have acquired from supervised fine-tuning and reinforcement learning from human feedback (RLHF). In our future work, we plan to incorporate more types of auxiliary information to assist LLM-based agents in simulating personalized behaviors. [1] Cui et al. "M6-rec: Generative pretrained language models are open-ended recommender systems." ArXiv:2205.08084 (2022). [2] Gao et al. "Chat-rec: Towards interactive and explainable llms-augmented recommender system." ArXiv:2303.14524 (2023). 6. Clarification of the collaborative reflection process Thanks for the comment. We will revise the presentation by adding more detailed explanations of the collaborative reflection process. To provide more clarity on how preference information is aggregated and propagated during the optimization phase, we have presented an additional figure. Please kindly refer to https://anonymous.4open.science/r/AgentCF-WWW/figure/illustration.png Below, we introduce four essential steps within the optimization process. [Step 1, Initialization] We first initialize user agents with general preference information like "I enjoy listening to CDs" and initialize item agents with their identity text, such as titles and descriptions. [Step 2, Autonomous Interaction] We then prompt the simulated agents to conduct autonomous interactions. Specifically, the user agent is tasked with selecting a preferred item from a contrastive pair of both a positive item agent and a negative item agent. [Step 3, Collaborative reflection] Due to the gap between language modeling and behavior modeling, the un-optimized agents are not yet personalized, which may lead to inconsistent interaction decisions with real-world interaction records. In this case, we prompt them to conduct collaborative reflection. Specifically, as illustrated in Figure 1, we prompt the user agent to align its simulated preferences with the characteristics of the positive item, and vice versa. This way, the simulated preferences of the user and item agents are exchanged and aggregated. [Step 4, Preference propagation] If the optimized agents continue to make incorrect choices, we prompt them to repeat steps 2 and 3. Once the optimized agents can make correct decisions, according to real-world interaction records (our training data), these agents will interact with other agents and further conduct steps 2 and 3. For example, an optimized item agent, which retains the preferences of previously interacted user agents like "The emotions evoked by the music continue to resonate with listeners", interacts with a new user agent and mutually aggregates their preference. After this process, the newly interacted user agent incorporates the preferences from previously interacted user agents, like "I prefer CDs evoking emotions that resonate over time". Consequently, this process enables preference propagation and implicitly models the collaborative filtering idea. We will incorporate the reviewer's suggestion to include these detailed specifications regarding the optimization process, to enhance the comprehension of our approach. Thank you! Thank you very much for your great questions and suggestions. Please let us know if you have any further questions, as we are happy to continue the discussion. If you find that our response addresses your concerns, would you kindly consider defending acceptance for our paper or even raising your rating score? We greatly appreciate your consideration. Gentle Reminder Official CommentAuthors (Yupeng Hou, Wenqi Sun, Leyu Lin, Xin Zhao, +4 more)11 Dec 2023, 06:41Program Chairs, Senior Area Chairs, Area Chairs, Reviewers Submitted, Authors Comment: Dear Reviewer, We hope you are doing well. We want to reach out to see if you have any further questions. If you do, we would appreciate the opportunity to respond before the discussion period ends. Thank you once again for your thoughtful review and help in improving the paper. We appreciate your time and consideration. Best regards, Authors of AgentCF Replying to Gentle Reminder Thanks to the Authors Official CommentReviewer sVXi11 Dec 2023, 08:39Program Chairs, Senior Area Chairs, Area Chairs, Reviewers Submitted, Authors Comment: I have read all the comments. They have clarified my concerns. Thanks for your response and support Official CommentAuthors (Yupeng Hou, Wenqi Sun, Leyu Lin, Xin Zhao, +4 more)12 Dec 2023, 23:05Program Chairs, Senior Area Chairs, Area Chairs, Reviewers Submitted, Authors Comment: We are glad we could address your concerns. We thank you again for the detailed review and insightful comments, which helped us improve the quality of the paper.