Reviewer #1 Questions 2. I am an expert on the topic of the paper. Disagree 3. The title and abstract reflect the content of the paper. Strongly agree 4. The paper discusses, cites and compares with all relevant related work Strongly agree 5. Please justify the previous choice (Required if “Strongly Disagree” or “Disagree” is chosen, otherwise write "n/a") n/a 6. Readability and paper organization: The writing and language are clear and structured in a logical manner. Strongly agree 7. The paper adheres to ISMIR 2024 submission guidelines (uses the ISMIR 2024 template, has at most 6 pages of technical content followed by “n” pages of references or ethical considerations, references are well formatted). If you selected “No”, please explain the issue in your comments. Yes 8. Relevance of the topic to ISMIR: The topic of the paper is relevant to the ISMIR community. Note that submissions of novel music-related topics, tasks, and applications are highly encouraged. If you think that the paper has merit but does not exactly match the topics of ISMIR, please do not simply reject the paper but instead communicate this to the Program Committee Chairs. Please do not penalize the paper when the proposed method can also be applied to non-music domains if it is shown to be useful in music domains. Strongly agree 9. Scholarly/scientific quality: The content is scientifically correct. Strongly agree 10. Please justify the previous choice (Required if "Strongly Disagree" or "Disagree" is chosen, otherwise write "n/a") n/a 11. Novelty of the paper: The paper provides novel methods, applications, findings or results. Please do not narrowly view "novelty" as only new methods or theories. Papers proposing novel musical applications of existing methods from other research fields are considered novel at ISMIR conferences. Agree 12. The paper provides all the necessary details or material to reproduce the results described in the paper. Keep in mind that ISMIR respects the diversity of academic disciplines, backgrounds, and approaches. Although ISMIR has a tradition of publishing open datasets and open-source projects to enhance the scientific reproducibility, ISMIR accepts submissions using proprietary datasets and implementations that are not sharable. Please do not simply reject the paper when proprietary datasets or implementations are used. Strongly agree 13. Pioneering proposals: This paper proposes a novel topic, task or application. Since this is intended to encourage brave new ideas and challenges, papers rated "Strongly Agree" and "Agree" can be highlighted, but please do not penalize papers rated "Disagree" or "Strongly Disagree". Keep in mind that it is often difficult to provide baseline comparisons for novel topics, tasks, or applications. If you think that the novelty is high but the evaluation is weak, please do not simply reject the paper but carefully assess the value of the paper for the community. Disagree (Standard topic, task, or application) 14. Reusable insights: The paper provides reusable insights (i.e. the capacity to gain an accurate and deep understanding). Such insights may go beyond the scope of the paper, domain or application, in order to build up consistent knowledge across the MIR community. Agree 15. Please explain your assessment of reusable insights in the paper. They offer an interesting perspective on managing the control/quality tradeoff by separating both objectives. They use fast sampling optimization and noise estimation for control and employ multi-step sampling to focus on quality. This method provides a deep understanding of the power of diffusion models and the level of control achievable at different points of the trajectory. 16. Write ONE line (in your own words) with the main take-home message from the paper. Controlling diffusion trajectories is key for generation quality, control, fast-sampling, and generalizability. 19. Potential to generate discourse: The paper will generate discourse at the ISMIR conference or have a large influence/impact on the future of the ISMIR community. Agree 20. Overall evaluation: Keep in mind that minor flaws can be corrected, and should not be a reason to reject a paper. Please familiarize yourself with the reviewer guidelines at https://ismir.net/reviewer-guidelines Strong accept 21. Main review and comments for the authors. Please summarize strengths and weaknesses of the paper. It is essential that you justify the reason for the overall evaluation score in detail. Keep in mind that belittling or sarcastic comments are not appropriate. Strengths: - The authors present a comprehensive pipeline for creating high-quality, controlled, and faster-than-real-time music. - Compared to the DITTO paper, the authors eliminate gradient checkpointing, replacing it with distillation methods. They relevantly simplify and speed up their consistency model by removing the GAN loss and unnecessary learned parameters, enabling training in just 32 GPU hours on a single A100. At inference time, they achieve 10 to 20 times faster sampling. - They demonstrate state-of-the-art text control from an unconditional diffusion model, highlighting the efficiency of their method and its applicability to various downstream tasks. Weaknesses: - Based on 5.1 and the quantitative results displayed in Table 1, they do not definitively determine whether Consistency models or Consistency Trajectory models are better suited for their objective. - Their experimental setup is limited in the variety of audio types tested. (speech, environmental sounds) Reviewer #2 Questions 2. I am an expert on the topic of the paper. Agree 3. The title and abstract reflect the content of the paper. Strongly agree 4. The paper discusses, cites and compares with all relevant related work Strongly agree 5. Please justify the previous choice (Required if “Strongly Disagree” or “Disagree” is chosen, otherwise write "n/a") n/a 6. Readability and paper organization: The writing and language are clear and structured in a logical manner. Strongly agree 7. The paper adheres to ISMIR 2024 submission guidelines (uses the ISMIR 2024 template, has at most 6 pages of technical content followed by “n” pages of references or ethical considerations, references are well formatted). If you selected “No”, please explain the issue in your comments. Yes 8. Relevance of the topic to ISMIR: The topic of the paper is relevant to the ISMIR community. Note that submissions of novel music-related topics, tasks, and applications are highly encouraged. If you think that the paper has merit but does not exactly match the topics of ISMIR, please do not simply reject the paper but instead communicate this to the Program Committee Chairs. Please do not penalize the paper when the proposed method can also be applied to non-music domains if it is shown to be useful in music domains. Strongly agree 9. Scholarly/scientific quality: The content is scientifically correct. Agree 10. Please justify the previous choice (Required if "Strongly Disagree" or "Disagree" is chosen, otherwise write "n/a") n/a 11. Novelty of the paper: The paper provides novel methods, applications, findings or results. Please do not narrowly view "novelty" as only new methods or theories. Papers proposing novel musical applications of existing methods from other research fields are considered novel at ISMIR conferences. Agree 12. The paper provides all the necessary details or material to reproduce the results described in the paper. Keep in mind that ISMIR respects the diversity of academic disciplines, backgrounds, and approaches. Although ISMIR has a tradition of publishing open datasets and open-source projects to enhance the scientific reproducibility, ISMIR accepts submissions using proprietary datasets and implementations that are not sharable. Please do not simply reject the paper when proprietary datasets or implementations are used. Disagree 13. Pioneering proposals: This paper proposes a novel topic, task or application. Since this is intended to encourage brave new ideas and challenges, papers rated "Strongly Agree" and "Agree" can be highlighted, but please do not penalize papers rated "Disagree" or "Strongly Disagree". Keep in mind that it is often difficult to provide baseline comparisons for novel topics, tasks, or applications. If you think that the novelty is high but the evaluation is weak, please do not simply reject the paper but carefully assess the value of the paper for the community. Disagree (Standard topic, task, or application) 14. Reusable insights: The paper provides reusable insights (i.e. the capacity to gain an accurate and deep understanding). Such insights may go beyond the scope of the paper, domain or application, in order to build up consistent knowledge across the MIR community. Agree 15. Please explain your assessment of reusable insights in the paper. The paper proposes a novel methodology to train Consistency Trajectory Models (CTM) that does not rely on adversarial losses and integrates classifier-free guidance in the training process. Although the details on the proposed method are scarce, it can be useful for future (music) generative models to be based on this simplified CTM framework. 16. Write ONE line (in your own words) with the main take-home message from the paper. After distilling a one-step generative model from a teacher diffusion model via consistency distillation, it is possible to speed up Inference Time Optimization (ITO) by an order of magnitude while performing comparably or better when compared to the framework introduced in DITTO. 19. Potential to generate discourse: The paper will generate discourse at the ISMIR conference or have a large influence/impact on the future of the ISMIR community. Agree 20. Overall evaluation: Keep in mind that minor flaws can be corrected, and should not be a reason to reject a paper. Please familiarize yourself with the reviewer guidelines at https://ismir.net/reviewer-guidelines Strong accept 21. Main review and comments for the authors. Please summarize strengths and weaknesses of the paper. It is essential that you justify the reason for the overall evaluation score in detail. Keep in mind that belittling or sarcastic comments are not appropriate. The paper proposes an improvement to the Diffusion Inference-Time T-Optimization (DITTO) framework that allows to achieve music generation with a variety of controls by optimizing the initial noise latent with gradient descent with respect to an arbitrary matching function. The proposed improvement allows the optimization to run much faster than previously possible by using a distilled generative model via consistency (trajectory) distillation. The paper is very well written. The wording is clear and understandable. The provided illustrations are stylistically pleasant and helpful/informative. From line 207 to 219, the authors describe a list of modifications that they propose for Consistency Trajectory Model (CTM) training that allows the model to be trained in less time and using a seemingly simpler training procedure. This appears as an impressive, significant and novel contribution to the field of generative modeling, and it would be ideal to include more details about what the performance gain/deficiency is compared to the original CTM formulation (which includes an adversarial loss and the consistency loss in clean data space). In line 308, the authors write: “For our base DM, we follow a similar setup and model design to DITTO…”, which indicates that a different model is used as a starting point for the two distillation approaches. It could be useful to specify why a new model was trained instead of using the same diffusion model trained for DITTO. Furthermore, it would be informative to understand what the difference in performance of the base diffusion model used for DITTO-2 is compared to the diffusion model used in the original DITTO paper, if there is any performance difference at all. In the comparisons, it is not clear whether the base DM from the original DITTO paper was used or the new base DM (before distillation) that was trained for DITTO-2 is used. More generally, when the authors reference DITTO, it can be difficult to understand if the DITTO framework is referenced, or if the model presented in the DITTO paper, using the DITTO framework, is referenced. It is recommended to clear up this confusion in the future revision of the paper. In Table 1 (benchmark results), the authors write: “Best performing configuration for each DITTO-2 setup across five unique tasks”. However, given that the table compares 4 metrics, it is not clear what the authors mean by “best performing configuration”. It is recommended to clarify which metric was chosen to guide the selection of the best (M,T) configuration in the future revision of the paper. The proposed adaptive schedule for the optimization steps M is intuitive and well presented. However, it could be argued that the best FAD is achieved when M=1 (and thus the runtime is half), and the adaptive schedule is only worth it in terms of a lower MSE. It is recommended to make clear which of the metrics is to be favored in this case and why. The evaluation presented in Table 3 (text similarity) appears to be flawed, since CLAP is used first to match the embeddings of generated samples to the embeddings of target text prompts, then it is used to calculate the CLAP score to evaluate prompt adherence. Using the embeddings of the same model both for “training” (in this case optimizing the latent) and testing provides an unfair comparison for the baseline MusicGen, which is not trained to maximize CLAP score. A possible solution is to use different CLAP models (ideally trained on different data) to optimize and test. A minor correction for line 75: “… without text inputs can yields SOTA text control.” yields -> yield Overall, the paper is scientifically well written and proposes a novel approach to speed up Inference Time Optimization (ITO) by an order of magnitude while performing comparably or better when compared to the original DITTO framework. While it would be preferable to provide a more clear explanation for the details described above (and propose a different evaluation for prompt adherence), it is a very well executed incremental improvement over previous research. Reviewer #3 Questions 2. I am an expert on the topic of the paper. Agree 3. The title and abstract reflect the content of the paper. Strongly agree 4. The paper discusses, cites and compares with all relevant related work Strongly agree 5. Please justify the previous choice (Required if “Strongly Disagree” or “Disagree” is chosen, otherwise write "n/a") n/a 6. Readability and paper organization: The writing and language are clear and structured in a logical manner. Agree 7. The paper adheres to ISMIR 2024 submission guidelines (uses the ISMIR 2024 template, has at most 6 pages of technical content followed by “n” pages of references or ethical considerations, references are well formatted). If you selected “No”, please explain the issue in your comments. Yes 8. Relevance of the topic to ISMIR: The topic of the paper is relevant to the ISMIR community. Note that submissions of novel music-related topics, tasks, and applications are highly encouraged. If you think that the paper has merit but does not exactly match the topics of ISMIR, please do not simply reject the paper but instead communicate this to the Program Committee Chairs. Please do not penalize the paper when the proposed method can also be applied to non-music domains if it is shown to be useful in music domains. Agree 9. Scholarly/scientific quality: The content is scientifically correct. Disagree 10. Please justify the previous choice (Required if "Strongly Disagree" or "Disagree" is chosen, otherwise write "n/a") Soundness: - line 250 - lacks an ablation on T<8 or a reference to an ablation. - line 115 - no definition/citation for the “ping pong” sampling. - line 120 - undefined notation - line 285 - V and w are undefined - line 297 - T is undefined - line 303 - Mgen and Mref aren’t defined - line 319 - lacks an ablation on the choice of 0.35. - figure 5 - It is unclear if the difference between CTM and the baseline is significant. I would expect adding the 0.95 confidence interval / std of the fad. - Real time results are reported only for the intensity control - line 358 - It is unclear if the superiority of CTM in terms of quality is significant, given the absence of std / confidence intervals of the measured metric. - line 392, table 2 - The differences in fad are insignificant. - table 3 - The clap scores of MusicGen on MusicCaps seems lower than the ones reported in the MusicGen paper/release, which were >= 0.28. Lacks an explanation for this mismatch. - table 3 - lacks a human evaluation comparing DITTO2’s text adherence performance to MusicGen’s. Listening to the supplementary material reveals that the text-to-music performance of DITTO2, demonstrated for 6 second samples, would be likely ranked significantly lower than MusicGen’s in a human study, in terms of both quality and text-adherence. - line 437 - The claim for SOTA results should be complemented with a human study in addition to the quantitative evaluation. Reproducibility: - No mentioned code publication Cosmetics: - line 366 - faster than real-time (inference time?)? - line 265 - any use -> any used? - line 381 - needs rephrasing delete the “the” 11. Novelty of the paper: The paper provides novel methods, applications, findings or results. Please do not narrowly view "novelty" as only new methods or theories. Papers proposing novel musical applications of existing methods from other research fields are considered novel at ISMIR conferences. Strongly disagree 12. The paper provides all the necessary details or material to reproduce the results described in the paper. Keep in mind that ISMIR respects the diversity of academic disciplines, backgrounds, and approaches. Although ISMIR has a tradition of publishing open datasets and open-source projects to enhance the scientific reproducibility, ISMIR accepts submissions using proprietary datasets and implementations that are not sharable. Please do not simply reject the paper when proprietary datasets or implementations are used. Agree 13. Pioneering proposals: This paper proposes a novel topic, task or application. Since this is intended to encourage brave new ideas and challenges, papers rated "Strongly Agree" and "Agree" can be highlighted, but please do not penalize papers rated "Disagree" or "Strongly Disagree". Keep in mind that it is often difficult to provide baseline comparisons for novel topics, tasks, or applications. If you think that the novelty is high but the evaluation is weak, please do not simply reject the paper but carefully assess the value of the paper for the community. Disagree (Standard topic, task, or application) 14. Reusable insights: The paper provides reusable insights (i.e. the capacity to gain an accurate and deep understanding). Such insights may go beyond the scope of the paper, domain or application, in order to build up consistent knowledge across the MIR community. Agree 15. Please explain your assessment of reusable insights in the paper. - The demonstrated ability to convert an unconditional diffusion model into a text conditional one with inference-time T-optimization, opens the door for future research. - The demonstrated ability to tame an unconditional model for different controls, in an inference-time T-optimization, without the need for gradient checkpointing, opens the door for future applications. 16. Write ONE line (in your own words) with the main take-home message from the paper. An unconditional diffusion model for music generation could be adapted to stylistic conditional generation, using a variety of controls, including global textual descriptions, using a real-time inference-time T-optimization strategy leveraging consistency distillation for efficiency. 19. Potential to generate discourse: The paper will generate discourse at the ISMIR conference or have a large influence/impact on the future of the ISMIR community. Disagree 20. Overall evaluation: Keep in mind that minor flaws can be corrected, and should not be a reason to reject a paper. Please familiarize yourself with the reviewer guidelines at https://ismir.net/reviewer-guidelines Weak reject 21. Main review and comments for the authors. Please summarize strengths and weaknesses of the paper. It is essential that you justify the reason for the overall evaluation score in detail. Keep in mind that belittling or sarcastic comments are not appropriate. Strengths: - The paper shows that by leveraging modified consistency [1] and consistency trajectory [2] distillation techniques, the runtime of DITTO [3] could be reduced significantly, and a real-time performance could be achieved for the intensity control. - The paper proposes the “surrogate optimization” in which a one-step process could be used for T-optimization while a multi-step diffusion process could be still applied to increase the quality of the final result. This is a novel improvement to the inference-time T-optimization techniques of prior work. - The simplifications for the CTM ([21] in the paper) distillation process, presented in lines 207-211, are novel. Together with the CTM modifications presented in lines 211-219, the authors present a strategy allowing for significant reduction in the CTM distillation compute demand. Weaknesses: - The main concern is the scientific quality of the reported results, and in specific, the lack of human studies comparing the results to the evaluated baselines - DITTO, MusicGen. - The claim for SOTA results isn’t complemented with a human study. Listening to the supplementary material reveals that the text-to-music performance of DITTO2, demonstrated for 6 second samples, would be likely ranked significantly lower than MusicGen’s in a human study, in terms of both quality and text-adherence. - Real-time inference performance, which is a core part of the paper contribution, is reported only for the intensity control, while for the other controls, the reported inference runtimes are slower than real-time. - The paper doesn’t mention a planned code release, which would have helped for reproducibility. Strengths: - The paper shows that by leveraging modified consistency ([20] in the paper) and consistency trajectory ([21] in the paper) distillation techniques, the runtime of DITTO ([17] in the paper) could be reduced significantly, and a real-time performance could be achieved for the intensity control. - The paper proposes the “surrogate optimization” in which a one-step process could be used for T-optimization while a multi-step diffusion process could be still applied to increase the quality of the final result. This is a novel improvement to the inference-time T-optimization techniques of prior work. - The simplifications for the CTM ([21] in the paper) distillation process, presented in lines 207-211, are novel. Together with the CTM modifications presented in lines 211-219, the authors present a strategy allowing for significant reduction in the CTM distillation compute demand. Weaknesses: - The main concern is the scientific quality of the reported results, and in specific, the lack of human studies comparing the results to the evaluated baselines - DITTO, MusicGen. - The claim for SOTA results isn’t complemented with a human study. Listening to the supplementary material reveals that the text-to-music performance of DITTO2, demonstrated for 6 second samples, would be likely ranked significantly lower than MusicGen’s in a human study, in terms of both quality and text-adherence. - Real-time inference performance, which is a core part of the paper contribution, is reported only for the intensity control, while for the other controls, the reported inference runtimes are slower than real-time. - The paper doesn’t mention a planned code release, which would have helped for reproducibility.