---- Comments from the Reviewers ---- Review #44FF *Is the work within the scope of the conference and relevant to ICASSP?*: Clearly within scope *Is the manuscript technically correct?*: Technically sound without any identifiable conceptual or mathematical errors, questionable experimental design choices, or weaknesses in experimental validation *Is the technical contribution novel?*: Substantial novelty, with clearly identifiable new methods/concepts *Is the level of experimental validation sufficient?*: Limited but convincing *Is the technical contribution significant?*: Substantial contribution, with a clear potential for impact *Are the references appropriate, without any significant omissions?*: A largely complete list of references with only minor omissions that would not affect the novelty of the submission *Are there any references that do not appear to be relevant?*: All references are directly relevant to the contribution of the manuscript *Is the manuscript properly structured and clearly written?*: Well-structured and clearly written with no issues of exposition *Comments to the Author(s)* This paper introduces selective TFG and Latent-Control Heads —two complementary ideas that substantially improve inference-time controllability for latent audio diffusion models at extremely low computational cost. The work is clearly written, technically sound, and demonstrates meaningful advances for controllable long-form audio generation. The authors identify a critical bottleneck in existing inference-time audio guidance: the computational burden of backpropagating through high-capacity audio decoders. Their proposed method circumvents this issue by predicting control features directly in latent space, thereby providing orders-of-magnitude speedups without requiring retraining of the base generative model. The addition of selective TFG is likewise a natural and impactful extension of prior guidance frameworks, allowing guidance to be applied only at sampling steps where it is most effective. Weaknesses: Hyperparameter tuning is complex, and although the authors provide intuition (Sec. 3.3), ablation studies would further strengthen the methodology section. Some controls (especially pitch) still show room for improvement, but the paper is transparent about these limitations and offers plausible explanations. ----------- Review #39B8 *Is the work within the scope of the conference and relevant to ICASSP?*: Clearly within scope *Is the manuscript technically correct?*: Moderate concerns with the potential for some impact on the contribution or conclusions *Is the technical contribution novel?*: Substantial novelty, with clearly identifiable new methods/concepts *Is the level of experimental validation sufficient?*: Lacking in some respect *Is the technical contribution significant?*: Moderate contribution, with the possibility of an impact on the field *Are the references appropriate, without any significant omissions?*: Complete list of references without any significant omissions *Are there any references that do not appear to be relevant?*: Some of the references are of limited relevance *Is the manuscript properly structured and clearly written?*: Serious structural, language, or other issues that impact the comprehensibility of the manuscript *Comments to the Author(s)* This paper explores some new approaches to signal generation using diffusion models. This reviewer has a significant problem with it, which is that it packs far too much new conceptual material for a 4-page conference paper. It is far too dense to be meaningfully assessed without either being a deep expert in the topic (ie probably not many people beyond the author group) or someone prepared to spend several hours analyzing and interpreting it. This reviewer is not prepared to do that! A conference paper should deal with a single topic and do it thoroughly. This paper takes 2 (probably) novel contributions and performs a large set of experiments on them that necessitate a half-page table. This table is then discussed in 1 paragraph plus 5 bullet points. The evaluation methods are barely discussed and are presented only as acronyms. There are also an abnormally large number of references and it seems the authors have bent the font size rules to pack them into 1 page. Just cite fewer papers! THere's also some technical issues with incompletely defined notation, but there's so much that isn't presented as a numbered equation that I'm afraid I'm not prepared to list it. ----------- Review #63A7 *Is the work within the scope of the conference and relevant to ICASSP?*: Clearly within scope *Is the manuscript technically correct?*: Technically sound without any identifiable conceptual or mathematical errors, questionable experimental design choices, or weaknesses in experimental validation *Is the technical contribution novel?*: Substantial novelty, with clearly identifiable new methods/concepts *Is the level of experimental validation sufficient?*: Sufficient validation/theoretical paper *Is the technical contribution significant?*: Substantial contribution, with a clear potential for impact *Are the references appropriate, without any significant omissions?*: Complete list of references without any significant omissions *Are there any references that do not appear to be relevant?*: All references are directly relevant to the contribution of the manuscript *Is the manuscript properly structured and clearly written?*: Well-structured and clearly written with no issues of exposition *Comments to the Author(s)* This is an excellent paper that addresses one of the most painful bottlenecks in current generative audio research which is the prohibitive cost of inference-time guidance. I really enjoyed reading this because the motivation is grounded in a real practical problem where backpropagating through the VAE decoder for every timestep makes standard guidance unusable for real-time applications or consumer hardware. The proposed solution of Latent-Control Heads (LatCH) is both elegant and highly effective; by training these lightweight heads to predict control features directly from the noisy latents you effectively bypass the expensive decoder entirely which yields massive speedups as shown in your experiments where runtime drops from 150 seconds to around 17 seconds. I also found the introduction of "Selective TFG" to be a very valuable insight for the community because it shows that applying guidance only in the early steps is sufficient for control while actually preserving better audio quality by avoiding off-manifold drift in the later stages. The comparison against "Readouts" from the vision domain is fair and insightful, demonstrating that your method of using the VAE latents is more robust for audio tasks. The paper is well-written, the ablation studies justify the design choices well, and the open-source nature of the project (with demos) is a big plus. It is rare to see a paper that improves efficiency by an order of magnitude without sacrificing performance, so I believe this is a definite accept that will likely influence future work in controllable audio generation significantly. -----------