---- Comments from the Reviewers ---- Review #5AE6 *Is the work within the scope of the conference and relevant to ICASSP?*: Clearly within scope *Is the manuscript technically correct?*: Some minor concerns that should be easily corrected without altering the contribution or conclusions *Is the technical contribution novel?*: Moderate novelty, with clear extensions of existing methods/concepts *Is the level of experimental validation sufficient?*: Lacking in some respect *Is the technical contribution significant?*: Moderate contribution, with the possibility of an impact on the field *Are the references appropriate, without any significant omissions?*: A largely complete list of references with only minor omissions that would not affect the novelty of the submission *Are there any references that do not appear to be relevant?*: Some of the references are of limited relevance *Is the manuscript properly structured and clearly written?*: Moderate issues of exposition that may require some time to correct, but do not substantially affect the ability to evaluate the technical content *Comments to the Author(s)* The paper is interesting and the language is generally fluid, although some typos are present and should be properly addressed. Examples include the use of unnecessary capital letters (e.g., designing FlashFoley: Other conditioning methods) or missing capitalization where needed. I would recommend reading through the paper a couple more times to correct these issues. However, although the paper is interesting and most of the text is well structured, the review finds the novelty of the contribution to be limited, as it appears to merge methods previously proposed in the literature rather than introducing a truly original approach. The review also finds the evaluation of the model to be very limited. A listening test with only 10 participants, without a clear explanation of where and how they were recruited, is not sufficient for acceptance, in my opinion. This information needs to be clearly reported to avoid potential biases in the evaluation procedure. Moreover, extensive research has shown that FAD scores strongly depend on the embeddings used. Therefore, the review would expect multiple FAD results to be reported. If this is not feasible, the authors should provide a proper explanation for why only a single FAD metric is included. Additionally, the improvements reported for the other metrics are marginal in some cases. Although the review considers this a relevant contribution to the conference and the community, it also believes that, given the current state of generative AI research, it is not possible to accept papers where the evaluation is insufficient, unclear, or lacking transparency, especially when the paper claims otherwise. Furthermore, the authors state that the model is open, yet the code is not available, and only five audio samples have been shared on the supplementary webpage. I would encourage the authors to substantially extend the work and consider submitting it to another venue or a journal, where there is sufficient space to clearly explain the methodology, provide detailed descriptions of the procedures, and conduct a proper and comprehensive evaluation of the study (if space constraints were the reason these aspects were omitted). ----------- Review #5721 *Is the work within the scope of the conference and relevant to ICASSP?*: Clearly within scope *Is the manuscript technically correct?*: Some minor concerns that should be easily corrected without altering the contribution or conclusions *Is the technical contribution novel?*: Substantial novelty, with clearly identifiable new methods/concepts *Is the level of experimental validation sufficient?*: Limited but convincing *Is the technical contribution significant?*: Substantial contribution, with a clear potential for impact *Are the references appropriate, without any significant omissions?*: A largely complete list of references with only minor omissions that would not affect the novelty of the submission *Are there any references that do not appear to be relevant?*: Some of the references are of limited relevance *Is the manuscript properly structured and clearly written?*: Well-structured and clearly written with no issues of exposition *Comments to the Author(s)* This work addresses the acceleration of sketch-to-audio generation, where human-produced vocal sketches or other sketch inputs are used to generate audio. The proposed method achieves approximately a 10× speed-up compared to previous approaches, while maintaining competitive generation quality. The paper is well written; however, several concerns arise regarding the evaluation of the generated audio. [1] In the subjective listening test, the paper states that participants were “given a caption and a vocal sketch.” Was the generated audio also presented to the listeners? As written, it appears that only the caption and vocal sketch were shown. Please clarify the full set of stimuli provided during evaluation. [2] In the subjective evaluation, participants were asked to assess the generated audio while being shown the caption and vocal sketch used as inputs. Does this evaluation measure how well the generated audio matches the caption and vocal sketch, or is it intended to assess perceptual audio quality? The intended evaluation perspective should be made explicit. Additionally, in cases where the generated audio aligns well with the caption but not with the vocal sketch (or vice versa), how are participants expected to assign scores? I think matching the caption and matching the vocal sketch constitute distinct evaluation axes and should be evaluated separately. Moreover, beyond alignment with inputs, a subjective evaluation of the intrinsic perceptual audio quality would also be needed. [3] How does the model behave when the caption and vocal sketch are inconsistent—for example, when the caption describes a quasi-stationary sound (e.g., “ventilation fan”) while the vocal sketch is temporally sparse or highly dynamic? In such scenarios, which input dominates the generation process? Clarification on the model’s behavior under contradictory conditions would strengthen the paper. ----------- Review #7A24 *Is the work within the scope of the conference and relevant to ICASSP?*: Clearly within scope *Is the manuscript technically correct?*: Some minor concerns that should be easily corrected without altering the contribution or conclusions *Is the technical contribution novel?*: Moderate novelty, with clear extensions of existing methods/concepts *Is the level of experimental validation sufficient?*: Limited but convincing *Is the technical contribution significant?*: Moderate contribution, with the possibility of an impact on the field *Are the references appropriate, without any significant omissions?*: Complete list of references without any significant omissions *Are there any references that do not appear to be relevant?*: All references are directly relevant to the contribution of the manuscript *Is the manuscript properly structured and clearly written?*: Moderate issues of exposition that may require some time to correct, but do not substantially affect the ability to evaluate the technical content *Comments to the Author(s)* This paper describes an extension of Sketch2Sound that adds further controls and fast inference. The text is very hard to follow, which makes it difficult to understand the contribution without reference to the Sketch2Sound paper. The authors should make an efffort to explain what are the advantages with respect to this system, and thus what is their contribution. The general methodology should be explained more clearly, describing how the system and the controls worh form the point of view of the user. The parts of Figure 1 shoudl be explained indpendently, and a more detailed architecture diagram would be helpful. Since the results are generally worse then the baseline, please explain clearly what is gained. Aso, the authors state that the model is open source, but there is no link or mention of how it would be available. -----------