Reviews Review 1 Relevance to SIGIR 5: (excellent) very relevant Technical depth 4: (good) Novelty 4: (good) Presentation quality 4: (good) Related work 3: (fair) Verifiability 4: (good) Potential impact 4: (good) Strengths 1. This paper presents personalized showcases task to generate both textual and visual explanations for recommendations. 2. The authors collect a large-scale dataset for this new task. And they design a novel multi-modal framework for this task and comparable experimental results are obtained 3. This paper is clearly structured and the proposed method solves the target task. Weaknesses 1. There is a lot of existing work in the area of explainable recommendations, however this paper lacks an introduction to them [1, 3]. 2. As one of the contributions to this paper, details of the datasets collected and the filtering methods are not shown. 3. The indicators for human evaluation are few and cannot provide strong validation of the explainability of the recommendation results. Existing work used a rich set of human evaluation metrics to valid the explainability of recommended results [1, 2]. Overall recommendation 1: (weak accept) Detailed comments to authors This paper presents a personalized cross-modal contrastive learning framework to generate diverse and visually-aligned explanations for personalized showcases. Experimental results show the effectiveness of purposed method for explainable recommendations. Here are some of the shortcomings: 1. Although this paper presents a new task, there are many related fields of work that still need to be introduced to clarify the value of the purposed task. Relevant fields that deserve to be introduced are: personalized recommendation, multimodal recommendation and explainable recommendation [1, 2 ,3]. 2. Information about the new data set in the paper needs to be described in more detail. This includes, but is not limited to: the number of restaurants included in the dataset, the number of reviews per restaurant, the length of text and number of images in the reviews and details of the filtering methods used to collect the data. 3. Explainable recommendations should facilitate user understanding and decision making [2]. About the human evaluation, this paper only takes “Expressiveness” and “Visual Alignment” into consideration. More evaluation perspectives can be considered, such as “Easy to understand”, “Comprehensive” and “Reasonable” [1, 2]. Overall, this paper has made contributions that present and resolve the personalized showcases task by introducing a multi-modal framework. The motivation is natural. The paper is well written and structured. [1] Zhang Y, Chen X. Explainable recommendation: A survey and new perspectives[J]. Foundations and Trends® in Information Retrieval, 2020, 14(1): 1-101. [2] Wang S, Gan T, Liu Y, et al. Micro-influencer Recommendation by Multi-perspective Account Representation Learning[J]. IEEE Transactions on Multimedia, 2022. [3] Xiao W, Zhao H, Pan H, et al. Beyond personalization: Social content recommendation for creator equality and consumer satisfaction[C]//Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2019: 235-245. Review 2 Relevance to SIGIR 4: (good) This paper proposes a multi-modal framework for generating personalized showcases, which is relevant to the SIGIR conference. Technical depth 3: (fair) Novelty 3: (fair) Presentation quality 4: (good) Related work 3: (fair) Verifiability 4: (good) Potential impact 4: (good) Strengths 1. This paper is well written, and easy to follow. 2. The motivation, i.e., generating multi-modal explanations for recommendation, is interesting. 3. The dataset would benefit the community for future research on this field. Weaknesses 1. The technical novelty of this paper is quite limited. The proposed framework is the simple combination of a set of existing works, e.g., CLIP, GPT2, and DPP. 2. Some experiment settings are missing. For example, it is not clear how the participants in the Human Evaluation were selected. Overall recommendation 1: (weak accept) Detailed comments to authors To enrich explanations for recommendation, this paper proposes a multi-modal framework to generate personalized showcases by employing contrastive learning. The effectiveness of the proposed method is validated on both automatic and human evaluation. Pros: 1. This paper is well written, and easy to follow. 2. The motivation, i.e., generating multi-modal explanations for recommendation, is interesting. 3. The dataset would benefit the community for future research on this field. Cons: 1. The technical novelty of this paper is quite limited. The proposed framework is the simple combination of a set of existing works, e.g., CLIP, GPT2, and DPP. 2. Some experiment settings are missing. For example, it is not clear how the participants in the Human Evaluation were selected. Review 3 Relevance to SIGIR 5: (excellent) The paper is within the scope of the SIGIR. Technical depth 3: (fair) The problems focused on this paper are interesting but not new. Novelty 3: (fair) This paper attempts to employ several state-of-the-art techniques to solve their problems. Presentation quality 4: (good) The presentation is good. But it is suggested that symbols should be explained when the first time they are used. Related work 4: (good) The citations are appropriate. Verifiability 4: (good) The main claims of the paper have been properly verified. Potential impact 3: (fair) The problems focused on this paper are interesting but not new. Strengths S1. This paper proposes a personalized multi-modal framework to generate diverse and visually-aligned explanations via contrastive learning. S2. This paper collects a large-scale dataset from Google Local (i.e., maps), and extracted high-quality samples with pre-processing and filtering. Weaknesses W1. It is recommended to compare it with the more advanced baseline. W2. The presentation needs an improvement. W3. The motivation description needs more effort. Overall recommendation 1: (weak accept) Detailed comments to authors O1. As stated by the authors, the main contribute in this paper is providing both textual and visual information to enrich explanation. There are some existing work which also focus on the same problem. O2. In the experiments, it is recommended to compare with more recent baselines. O3. The results of the experimental part need some explanation, such as why using a single modality (text or image) outperforms multiple modalities. O4. The presentation needs an improvement. It is suggested that symbols should be explained when the first time they are used, e.g., HX and HY. Metareview Metareview for paper 3660 Title: Personalized Showcases: Generating Multi-Modal Explanations for Recommendations Authors: An Yan, Zhankui He, Jiacheng Li, Tianyang Zhang and Julian McAuley Text: This paper proposes a personalized multi-modal framework for generating diverse and visually-aligned explanations through contrastive learning. It has been validated through both automatic and human evaluation, with a large dataset collected from Google Local. Reviewers admire the novelty of the proposed framework, the effective results obtained, and the clear presentation of the paper. Weaknesses include the lack of introduction to related works, better baseline comparison, missing experiment settings, and the need for improvement in presentation and motivation description. Reviewer 1 highlights the need for more details on the collected dataset, filtering methods, and a richer set of human evaluation metrics to validate explainability. Reviewer 3 suggests explaining the reasons behind single modality outperforming multiple modalities and improving symbol explanation in the presentation. Despite these, the paper makes a valuable contribution to the field of explainable recommendations, and I would recommend it for presentation at SIGIR'23. The authors are advised to carefully consider the reviewers' comments and the new points raised to enhance the quality of their work.