---- Comment from the Review Committee ---- The paper received positive reviews from all reviewers. This is a nice contribution on a timely topic. The rebuttal clarified a few minor confusion the reviewers had raised. The authors are requested to revise the paper following the reviewers comments as much as possible. ------- ---- Comments from the Reviewers ---- Review #4C2E *Is the work within the scope of the conference and relevant to ICASSP?*: Clearly within scope *Is the manuscript technically correct?*: Some minor concerns that should be easily corrected without altering the contribution or conclusions *Is the technical contribution novel?*: Substantial novelty, with clearly identifiable new methods/concepts *Is the level of experimental validation sufficient?*: Limited but convincing *Is the technical contribution significant?*: Substantial contribution, with a clear potential for impact *Are the references appropriate, without any significant omissions?*: Complete list of references without any significant omissions *Are there any references that do not appear to be relevant?*: All references are directly relevant to the contribution of the manuscript *Is the manuscript properly structured and clearly written?*: Well-structured and clearly written with no issues of exposition *Comments to the Author(s)* This is a strong dataset paper that identifies a real gap in conversational recommendation research and proposes a benchmark that is both well-motivated and carefully constructed. The pipeline for filtering Reddit conversations, validating audio grounding, and extracting structured queries is reasonable and clearly justified. The multimodal evaluation design is a major strength, and the experiments in Table 1 and Fig. 4 highlight meaningful differences between model types and genres. The finding that many models perform best in single-modality settings is convincing and aligns with known multimodal integration issues. The dataset examples on page 2 illustrate good coverage across genres. The paper’s main limitation is that some technical details around LLM prompting and sampling policies could be described with more precision, but this does not detract from the contribution. Overall, this benchmark should be useful to the community and is well-positioned for acceptance. ----------- Review #6B82 *Is the work within the scope of the conference and relevant to ICASSP?*: Clearly within scope *Is the manuscript technically correct?*: Some minor concerns that should be easily corrected without altering the contribution or conclusions *Is the technical contribution novel?*: Moderate novelty, with clear extensions of existing methods/concepts *Is the level of experimental validation sufficient?*: Sufficient validation/theoretical paper *Is the technical contribution significant?*: Moderate contribution, with the possibility of an impact on the field *Are the references appropriate, without any significant omissions?*: Complete list of references without any significant omissions *Are there any references that do not appear to be relevant?*: All references are directly relevant to the contribution of the manuscript *Is the manuscript properly structured and clearly written?*: Well-structured and clearly written with no issues of exposition *Comments to the Author(s)* Strengths: 1、The paper is generally well-written, clearly structured, and easy to follow. 2、MusiCRS represents the first audio-centric conversational music recommendation benchmark, with grounded multimodal annotations and diverse genres. This is a meaningful resource that can benefit future research. 3、Comprehensive experimental evaluation across multiple paradigms (generative / retrieval / traditional), modalities, and genres. Weaknesses: 1、The paper primarily introduces a dataset and evaluation protocol; it does not propose new modeling techniques or algorithms. As a result, the contribution is more empirical than technical. 2、 The dataset, while valuable, is still relatively small (477 Reddit discussions) and limited to Reddit conversations, which may constrain generalizability. ----------- Review #0781 *Is the work within the scope of the conference and relevant to ICASSP?*: Clearly within scope *Is the manuscript technically correct?*: Technically sound without any identifiable conceptual or mathematical errors, questionable experimental design choices, or weaknesses in experimental validation *Is the technical contribution novel?*: Substantial novelty, with clearly identifiable new methods/concepts *Is the level of experimental validation sufficient?*: Limited but convincing *Is the technical contribution significant?*: Substantial contribution, with a clear potential for impact *Are the references appropriate, without any significant omissions?*: Complete list of references without any significant omissions *Are there any references that do not appear to be relevant?*: All references are directly relevant to the contribution of the manuscript *Is the manuscript properly structured and clearly written?*: Some minor structural, language, or other issues of exposition that would be easily rectified *Comments to the Author(s)* This paper presents an important contribution: MusiCRS, a benchmark for audio-centric conversational recommendation. The public release of MusiCRS will likely promote further research in the multimodal domain. MusiCRS contains 477 high-quality dialogues spanning multiple genres (Classical, Hip-Hop, Electronic, Metal, Pop, Indie, Jazz) and 3,589 unique music entities, with audio grounded via YouTube links. The dataset supports evaluation under three input-modality settings: audio-only, query-only, and audio+query (multimodal). However, there are several points in the experimental analysis that could be improved or clarified: 1. In RQ1, the paper states:“However, only 3 of 9 evaluated models achieve their best performance in multimodal settings.” According to Table 1 (Overall), this appears to be 2 of 9 instead. Additionally, Table 1 notes that “Bold values indicate the best-performing modality configuration for each model”, but some values that should be bolded are not. 2. In RQ2, the paper states:“Retrieval-based approaches demonstrate superior performance compared to generative models across all metrics. CLAP achieves the highest overall performance (22.71% Recall@20), followed by CoLLAP (20.85%), while the best-performing generative model (Qwen2.5-Omni) reaches 21.93%.” However, the retrieval-based CoLLAP (20.85%) does not surpass the generative model Qwen2.5-Omni (21.93%). 3. Also in RQ2:“Within generative models, performance varies considerably (17.42% to 21.93%), whereas retrieval models maintain more consistent performance across input modality configurations.”Even considering only Recall@20 on Overall, the range is 15.77% to 21.93%. Moreover, the generative models’ wider range may be partly due to the fact that 7 generative models were evaluated versus only 2 retrieval models, making the retrieval range naturally narrower. 4. In RQ3,“Metal demonstrates strong performance with retrieval models (CLAP: 26.42% audio-only) but exhibits weaker results with generative approaches.”The retrieval model CoLLAP actually shows relatively weak performance. Statements such as “Classical genres present persistent challenges across all approaches” and “Electronic genres show favorable performance with generative models” also appear inconsistent with the fig. ----------- Review #4A3A *Is the work within the scope of the conference and relevant to ICASSP?*: Clearly within scope *Is the manuscript technically correct?*: Moderate concerns with the potential for some impact on the contribution or conclusions *Is the technical contribution novel?*: Limited novelty, not clearly differentiated from existing methods/concepts *Is the level of experimental validation sufficient?*: Limited but convincing *Is the technical contribution significant?*: Substantial contribution, with a clear potential for impact *Are the references appropriate, without any significant omissions?*: Complete list of references without any significant omissions *Are there any references that do not appear to be relevant?*: All references are directly relevant to the contribution of the manuscript *Is the manuscript properly structured and clearly written?*: Some minor structural, language, or other issues of exposition that would be easily rectified *Comments to the Author(s)* 1. The article primarily focuses on processing Reddit data. It explores whether the YouTube audio clips have undergone processing—such as noise reduction—and whether the selection of clips covers the core sections of the songs. Additionally, it examines the potential impact of these processing steps on the model evaluation results. 2. As noted in the article, a fundamental limitation of current multimodal models has been identified: these models often achieve optimal performance in unimodal scenarios. While this conclusion is derived from experiments, it raises a critical question: could the observed limitation stem from the dataset being poorly suited to existing methods? It would therefore be preferable to develop an approach tailored specifically to the characteristics of this dataset. 3. The experimental results show that both classical music and hip-hop music pose challenges to all models. The specific reasons for this can be analyzed. -----------