---- Comments from the Reviewers ---- Review #0F5E *Is the work within the scope of the conference and relevant to ICASSP?*: Clearly within scope *Is the manuscript technically correct?*: Some minor concerns that should be easily corrected without altering the contribution or conclusions *Is the technical contribution novel?*: Moderate novelty, with clear extensions of existing methods/concepts *Is the level of experimental validation sufficient?*: Limited but convincing *Is the technical contribution significant?*: Insufficient contribution for a full-length regular paper, but suitable for short paper *Are the references appropriate, without any significant omissions?*: A largely complete list of references with only minor omissions that would not affect the novelty of the submission *Are there any references that do not appear to be relevant?*: All references are directly relevant to the contribution of the manuscript *Is the manuscript properly structured and clearly written?*: Some minor structural, language, or other issues of exposition that would be easily rectified *Comments to the Author(s)* The proposed method is conceptually straightforward and demonstrates meaningful potential for singing-voice–related applications. The authors provide both objective and subjective evaluations, comparing their approach against several baselines and variants. However, the method appears to depend heavily on the Voicebox pipeline. It is not entirely clear how the proposed system ultimately generates the output audio signal. From the demo page, the synthesized audio contains noticeable artifacts, likely attributable to incorrect phase estimation, yet the manuscript does not discuss this component in sufficient detail. While the primary focus of the paper is F0 generation—so phase-related issues may not constitute the main contribution—the description of the complete generation pipeline remains insufficient. Additionally, the pitch-generation pipeline seems quite similar to that of Voicebox, raising questions about the novelty of the proposed approach relative to prior work. It would be beneficial for the authors to explicitly clarify these differences in the final manuscript should the paper be accepted. Minor comments - 1st sentence at Related Work: “Many singing generation frameworks …” → any reference? - Fig.1 - modify the labels (e.g., x_{ctx} → x_\text{ctx}) - Fig.1(b) - not sure what “/” means and if it’s explained in the manuscript. (e.g., x_off / x_ref) - Fig.1 - not sure if y_in and y_tgt is mentioned inside the manuscript ----------- Review #037B *Is the work within the scope of the conference and relevant to ICASSP?*: Clearly within scope *Is the manuscript technically correct?*: Moderate concerns with the potential for some impact on the contribution or conclusions *Is the technical contribution novel?*: Moderate novelty, with clear extensions of existing methods/concepts *Is the level of experimental validation sufficient?*: Limited but convincing *Is the technical contribution significant?*: Moderate contribution, with the possibility of an impact on the field *Are the references appropriate, without any significant omissions?*: A largely complete list of references with only minor omissions that would not affect the novelty of the submission *Are there any references that do not appear to be relevant?*: All references are directly relevant to the contribution of the manuscript *Is the manuscript properly structured and clearly written?*: Some minor structural, language, or other issues of exposition that would be easily rectified *Comments to the Author(s)* In this work, the authors identified a major limitation of existing pitch curve tracking methods, specifically for singing voice. The methods heavily rely on conditional rectified flow matching, with the piano roll score as the guiding input. Overall, the paper appears sound and interesting, but there are multiple areas of unclear writing and contradiction between the equations and the figure. Additionally, it is not quite clear to me how novel the methodology is, due to the highly condensed lit review, with very little discussion of how existing flow-/diffusion-based contour generation systems function. I am rating this work a borderline at the current stage; however, considering that the criticisms can all be modified fairly quickly, I am very open to adjusting this to a marginal or even strong accept, pending the authors' clarification. Specifically, I would request that the authors clarify areas of ambiguity indicated below and better situate their contributions in the context of existing works. - While RFM makes sense as a generation method, the authors need to justify better why RFM is used specifically over other generation methods beyond just being a "stable, efficient, and high-quality generation process" (p. 1, col. 2). Is this simply following VoiceBox? - Considering Eq. 4, should the masked portion of $x_t$ in Fig. 1 not simply be zero-mean noise? This is actually not well-defined in the paper, but the choice of $x_0$ here is important. Based on Eq. 4, the masked portion of $x_0$ is simply $\epsilon$, but according to Fig. 1, the masked portion of $x_0$ is actually $x + \epsilon$. - The inference in Eq. 5 is again unclear. The authors have to be explicit about what $x_t$ and $x_{ctx}$ are during inference. From Fig. 1, it appears that $x_ctx$ would be [ref f0, zeros], but it is unclear what $x_t$ would be. Would this be $[ref f0, random noise]$ per training, or also $[ref f0, zero]$? I understand that 3.3 is supposed to address this, but Section 3.3 is also more confusing than helpful, partly because of inconsistent notations. The authors need to define better what exactly context $x_ctx$, initial condition $x_t$, guiding MIDI $y$, and alignment $u$ are in each case. - While I understand the tight page limit, the method section is very much describing the method without sufficiently justifying. I can infer what the authors are trying to do and why, but this has to be more explicit and clear. The point of this setup is precisely that the model is not simply generating f0 based on the "robotic" MIDI conditioning, but that it has to take into account the pitch variations (i.e. style) inferred from $x_ctx$ to do so. - It is unclear to me what the unvoiced indicator is. What do 0 and 1 mean in this case in terms of alignment? Why is this not part of the arguments to the velocity field? I currently have no idea where it goes in the process. - Para 3 of 3.2 is more confusing than helpful. It might be better moved to 4.2. - APC was not defined in the first paragraph of 4.3. It was only defined a few paragraphs down. - Is the smoothing algorithm referring to the MIDI smoothing in 4.1? - Para 2 of 4.3: Just call the LSTM as a discriminator. This will be a lot less confusing. This took me a while to figure out. ----------- Review #3E28 *Is the work within the scope of the conference and relevant to ICASSP?*: Clearly within scope *Is the manuscript technically correct?*: Some minor concerns that should be easily corrected without altering the contribution or conclusions *Is the technical contribution novel?*: Moderate novelty, with clear extensions of existing methods/concepts *Is the level of experimental validation sufficient?*: Limited but convincing *Is the technical contribution significant?*: Moderate contribution, with the possibility of an impact on the field *Are the references appropriate, without any significant omissions?*: A largely complete list of references with only minor omissions that would not affect the novelty of the submission *Are there any references that do not appear to be relevant?*: All references are directly relevant to the contribution of the manuscript *Is the manuscript properly structured and clearly written?*: Some minor structural, language, or other issues of exposition that would be easily rectified *Comments to the Author(s)* This paper presents StylePitcher, a general-purpose, style-following pitch curve generator based on rectified flow and conditional infilling. The core idea—decoupling pitch generation from task-specific models and enabling plug-and-play style-preserving F0 generation for APC, SVS, and SVC—is novel and well motivated. The architecture is technically sound, and the proposed smoothing strategy, masked inpainting, and conditioning design are clearly explained. The experimental evaluation is convincing. Across three tasks, StylePitcher achieves strong improvements in style similarity and competitive audio quality while maintaining reasonable pitch accuracy. Objective metrics, subjective MOS tests, and ablations (smoothing, context masking) support the claims. Visual results in Fig. 2 further illustrate effective capture of vibrato and expressive pitch nuances. Some minor issues remain. Pitch correction accuracy is slightly weaker than Diff-Pitcher, and the trade-off between style preservation and accuracy could be discussed more. Certain failure cases in SVC (occasional unnatural expressiveness) are acknowledged but not analyzed. Complexity/runtime information is missing, which would help assess practical deployment. Overall, the paper is structured and presents a meaningful and broadly applicable contribution to singing processing. It is a strong submission and merits acceptance with minor improvements. -----------