Reviews Review 1 Relevance to SIGIR Perspectives 5: (excellent) Appropriateness 5: (excellent) Perceived significance or potential impact of the perspecetive contribution 4: (good) Quality of perspective presentation 4: (good) Does it have a point? 5: (excellent) Context 4: (good) Strengths This paper studied an important research problem in the generative era: how to conduct effective evaluation for generative RS. They have made two major contributions as the paper expressed: (1) They categorize the evaluation challenges of Gen-RecSys into two groups; (2) They propose holistic evaluation approach that includes scenario-based assessments and multi-metric checks. Overall, the paper is clearly written and well organized. Weaknesses After my personal read, I have three major concerns: (1) I think the proposed problem is very important, while the corresponding solution seems to be less interesting. For me, I can't intuitively understand how scenario-based assessments and multi-metric checks differ from existing works or a combination of existing techniques. In particular, I feel that the examples presented in Section 4 are somehow over-simplified, which seems less interesting to me. (2) It is generally good to have challenges categorized and summarizes. One possible suggestion is to connect these challenges with the proposed solutions, i.e., how each issue is solved by existing solutions. For me, I can't well understand how these challenges are addressed in the proposed evaluation framework. Detailed comments to authors This paper focuses on discussing evaluation of recommender systems powered by generative models. Overall, I think the task is very important and worth research, and I also enjoy a read of this paper. Despite these merits (see Strengths), I still have some major concerns which are summarized in Weaknesses. In general, I can't well understand how the current framework addresses the categorized challenges. Your final vote for this paper 2: (Lean to Reject) Review 2 Relevance to SIGIR Perspectives 5: (excellent) Appropriateness 5: (excellent) Perceived significance or potential impact of the perspecetive contribution 4: (good) Quality of perspective presentation 5: (excellent) Does it have a point? 5: (excellent) Context 5: (excellent) Strengths - well-written, well-organized perspective on an issue of growing importance - presents a comprehensive, forward-looking framework for thinking about the application of LLM-based evaluations to recommender systems - includes examples illustrating the principles given in the paper Weaknesses - LLMs are still advancing fast; the issues with LLM evaluation today may not be those at the time of publication Detailed comments to authors I appreciate that the paper presents a single, unified perspective from researchers in both academia and industry. It is clear that the authors have given a lot of thought to what the future of LLM-powered RecSys might look like, and have carefully considered the questions that will arise in the evaluation of those systems. I think this paper is relevant now and will continue to be relevant and cited as LLM-powered RecSys systems grow in prominence, and LLM-powered evaluation grows in importance. I also really like the future directions given in the conclusion. Interpretability and multi-stakeholder objectives are important topics for which there seems to be no good solution yet. If there is any weakness to this paper, it might be that the field is still evolving so fast that it could go in a completely different direction than the authors envision. But I don't think this is very likely. Minor comment: Sections 3 and 4 have the same title, I suspect this is not meant to be the case. Your final vote for this paper 5: (Accept) Review 3 Relevance to SIGIR Perspectives 4: (good) Appropriateness 4: (good) Perceived significance or potential impact of the perspecetive contribution 4: (good) Quality of perspective presentation 4: (good) Does it have a point? 4: (good) Context 4: (good) Strengths 1. The motivation is strong 2. Proposal of a holistic evaluation framework that combines scenario-based evaluations (e.g., multi-turn dialogues, domain shifts) with multi-dimensional metrics 3. Thoughtful and systematic approach 4. Well written Weaknesses 1. The connection between language model-specific issues—such as hallucination and the lack of reference data—and recommender system evaluation challenges could be articulated more clearly. 2. The discussion of related work is somewhat limited. 3. Although the paper recognizes the importance of human-centric evaluation, it does not sufficiently address how human feedback, preferences, or interaction data could be systematically incorporated into the evaluation process. Detailed comments to authors This paper addresses a timely and important problem: how to evaluate generative recommender systems (Gen-RecSys) beyond conventional accuracy-based metrics. The motivation is strong, as traditional evaluation protocols often fail to capture the complexities introduced by open-ended, language-driven outputs, which limits the reliability and reproducibility of recent research in this area. A key contribution of the paper is its proposal of a holistic evaluation framework that combines scenario-based evaluations (e.g., multi-turn dialogues, domain shifts) with multi-dimensional metrics, such as relevance, factuality, fairness, and alignment with platform policies. The WH-question-based structure (What, Why, How, Which, Where, and For Whom) is a thoughtful and systematic approach, and the inclusion of multiple stakeholder perspectives makes the framework especially well-suited for real-world applications. The paper is clearly written, logically structured, and grounded in both academic and industrial viewpoints, making it a valuable contribution to the emerging field of Gen-RecSys. That said, there are areas where the paper could be strengthened. First, the connection between language model-specific issues—such as hallucination and the lack of reference data—and recommender system evaluation challenges could be articulated more clearly. While these issues are mentioned, the implications for recommendation tasks and potential mitigation strategies are not fully explored. Second, the discussion of related work is somewhat limited. A broader and deeper engagement with recent literature on LLM-based recommendation systems would better contextualize the proposed framework. Third, although the paper recognizes the importance of human-centric evaluation, it does not sufficiently address how human feedback, preferences, or interaction data could be systematically incorporated into the evaluation process. Overall, the paper makes a meaningful and positive contribution to the field. Its proposed framework offers a solid foundation for future work, and with a few refinements, it could become a key reference in the evaluation of generative recommender systems. 1. Liang, P. et al. (2022). Holistic evaluation of language models. arXiv preprint arXiv:2211.09110​ ​2. Bang, Y., Madotto, A., & Fung, P. (2022). A Survey of Hallucination in Natural Language Generation. ACM Comput. Surv., 55(9)​ 3. Wang, X., et al. (2023). Rethinking the evaluation for conversational recommendation in the era of LLMs. EMNLP 2023​ 4. Wu, L., et al. (2023). A Survey on Large Language Models for Recommendation. arXiv:2305.19860​ Your final vote for this paper 4: (Lean to Accept) Metareview Metareview for paper 2354 Title Toward Holistic Evaluation of Recommender Systems Powered by Generative Models Authors Yashar Deldjoo, Nikhil Mehta, Maheswaran Sathiamoorthy, Shuai Zhang, Pablo Castells and Julian McAuley Recommendation accept Text Two reviewer recommendations fall on the accept side ("Accept" and "Lean to Accept"), while one fell on the reject side ("Lean to Reject"). After considering this paper and these reviews, the track chairs have decided to accept this paper. All reviewers are generally positive about this work in their comments. The paper articulates a clear perspective about a challenging issue that is informed by industry and academia. The presented framework is comprehensive and examples are included to illustrate the perspective. The tables organizing the differences and challenges are excellent and will likely help others understand and study the area. The figure of the holistic framework is also excellent.