Paper ID 59 Paper Title Fast Text-to-Audio Generation with Adversarial Post-Training Track Name WASPAA2025 Reviewer #1 Questions 1. How confident are you in your evaluation of this paper? 1. Less Confident 2. Importance/Relevance 3. Of sufficient interest 4. Novelty/Originality 3. Moderately original 6. Technical Correctness 3. Probably correct 8. Experimental Validation 3. Limited but convincing 10. Clarity of Presentation 2. Difficult to read 11. Justification of Clarity of Presentation Score (required if the score is 1 or 2) Section two is difficult to follow for the reader who is not familiar with the details of Rectified Flow training and previously proposed inference acceleration strategies. It is questionable if a space-limited WASPAA paper is the right place to present such a deep topic. 12. Reference to Prior Work 3. References adequate 14. Overall evaluation of this paper 2. Marginal reject 15. Detailed assessment of the paper The paper is about accelerating inference of flow-based neural text-to-audio models. Achieving low latency and high synthesis quality is a challenge, which the authors show can be tackled with their proposed approach. The author's base model is using Rectified Flows trained on 6k hours of Freesound recordings, mainly focussing on loops and sound effects. The main contribution of the authors is the application of Adverserial Relativistic-Contrastive post-training (ARC) to reduce the number of necessary sampling steps during inference, while keeping the quality of the synthesized audio at a high level. The relativistic adverserial loss formulation has previously been introduced but not applied to audio. The authors find that this loss (see eq. 5) alone leads to high diversity but insufficient adherence to the textual prompts. Thus, they add the contrastive loss term (see eq. 8) to counteract this problem. I will focus mainly on the MOS-ratings reported by the authors, they show that high quality synthesis with high diversity is possible, but at an RTF around 150x real-time on GPU. This opens up for creative applications which are sketched in the end of the paper and illustrated with audio examples on their accompanying website. The goal of the paper is clear, but I have to admit that I can only understand the technical contribution at the surface level. I fear that also other readers will have limited benefit from the contents paper. The authors promise to open source their work, but still it is desirable to go deeper and provide a thorough analysis of the inner workings of their proposed method. Therefore, I can not recommend acceptance at WASPAA. I don't find the audio examples to be musically engaging, but that's just my personal opinion and is also true for the other systems under test, so it is likely that the distribution of training data is just insufficient. Reviewer #2 Questions 1. How confident are you in your evaluation of this paper? 2. Confident 2. Importance/Relevance 3. Of sufficient interest 4. Novelty/Originality 3. Moderately original 6. Technical Correctness 3. Probably correct 8. Experimental Validation 3. Limited but convincing 10. Clarity of Presentation 3. Clear enough 12. Reference to Prior Work 3. References adequate 14. Overall evaluation of this paper 3. Marginal accept 15. Detailed assessment of the paper This paper proposes a method to accelerate flow and diffusion models using GAN-based post-training. The proposed approach introduces a loss function that combines relativistic adversarial loss and contrastive loss. It is particularly interesting that both types of losses are applied in the context of audio synthesis. The proposed method is evaluated through comparisons with existing diffusion models and ablation studies, demonstrating its effectiveness. It would have been even better if comparisons had also been made with non-autoregressive, fast TTS models that are not based on diffusion, such as VITS or FastSpeech2. Reviewer #3 Questions 1. How confident are you in your evaluation of this paper? 3. Very Confident 2. Importance/Relevance 3. Of sufficient interest 4. Novelty/Originality 3. Moderately original 6. Technical Correctness 3. Probably correct 8. Experimental Validation 4. Sufficient validation / theoretical paper 10. Clarity of Presentation 3. Clear enough 12. Reference to Prior Work 3. References adequate 14. Overall evaluation of this paper 4. Definite accept 15. Detailed assessment of the paper This paper proposed a novel high-quality fast text-to-audio genration system. To improve sound quality, authors introduce adversarial relativistic loss, which is GAN using paired audio-prompts and generated audios, and contrastive loss which is to discriminate correct and incorrect prompts. To make inference faster, authors also apply ping-pong sampling and int8 quantization to the proposed system. According to experimental results, the proposed system can improve the diversity of the generated audio while keeping audio quality. Also, inference speech is significantly improved compared with baseline systems. Reviewer #4 Questions 1. How confident are you in your evaluation of this paper? 2. Confident 2. Importance/Relevance 4. Of broad interest 4. Novelty/Originality 3. Moderately original 6. Technical Correctness 4. Definitely correct 8. Experimental Validation 3. Limited but convincing 10. Clarity of Presentation 3. Clear enough 12. Reference to Prior Work 3. References adequate 14. Overall evaluation of this paper 4. Definite accept 15. Detailed assessment of the paper Summary This paper introduces Adversarial Relativistic-Contrastive (ARC) post-training, a novel framework to accelerate text-to-audio generation in Gaussian flow models (e.g., diffusion/rectified flows) without relying on distillation or Classifier-Free Guidance (CFG). ARC combines a relativistic adversarial loss to improve sample realism and a contrastive discriminator loss to enhance prompt adherence. By applying ARC to a optimized version of Stable Audio Open (SAO), the authors achieve unprecedented inference speeds: ~75 ms for 12 seconds of 44.1kHz stereo audio on an H100 GPU and ~7 seconds on mobile edge devices, while maintaining competitive quality and significantly improving generative diversity compared to distillation-based methods like Presto. The model avoids the high memory costs of distillation and demonstrates practical edge-device deployment via dynamic quantization. Advantages 1. Innovative Non-Distillation Acceleration ARC represents the first fully adversarial text-to-audio acceleration method, circumventing the need for teacher models or trajectory-output pairs required by distillation. This reduces training complexity and memory overhead, making it more scalable for real-world applications. 2. Effective Combination of Relativistic and Contrastive Losses The relativistic adversarial loss improves sample realism by comparing generated samples to real ones within the same prompt context, while the contrastive loss explicitly enforces alignment between audio and text prompts. This dual objective addresses the common trade-off between diversity and prompt adherence, as shown by higher CLAP Conditional Diversity Score (CCDS) compared to baselines. 3. State-of-the-Art Speed and Edge Deployment The model achieves a 100× speedup over the original SAO while maintaining acceptable quality metrics (e.g., FD_open13, KL_passt). Dynamic Int8 quantization further optimizes it for edge devices, enabling local inference on smartphones with minimal RAM usage (3.6GB). 4. Diversity Preservation Unlike distillation methods like Presto, which sacrifice diversity for quality, ARC explicitly promotes generative variety, as evidenced by both quantitative CCDS and qualitative listening tests. This is critical for creative applications requiring diverse outputs from the same prompt. Weaknesses 1. Quality Trade-offs While ARC accelerates inference significantly, its audio quality (MOS score: 3.5 ± 0.4) is slightly lower than the pre-trained RF model (3.7 ± 0.3) and SAO (4.0 ± 0.3). The FD_open13 score (84.43) also lags behind SAO’s 78.24, indicating a potential compromise in audio fidelity for speed. 2. Limited Audio Complexity Scope The study focuses on sound effects and loops, excluding long-form music (e.g., FMA dataset). Extending ARC to complex audio structures (e.g., multi-instrumental music) remains unaddressed, which may limit its applicability in music production domains. 3. CFG Avoidance Limitations Although avoiding CFG reduces memory usage, it may limit users’ ability to fine-tune generation precision. The paper could explore hybrid approaches that balance CFG benefits with ARC’s efficiency for controlled scenarios. 4. No comparison with distillation and CFG Based on these two methods, the model can also be accelerated. The author should provide their own methods and compare the results of these methods, analyzing their strengths and weaknesses. Conclusion This work makes a significant contribution to real-time text-to-audio generation by introducing an efficient, non-distillation adversarial framework. ARC’s focus on speed, diversity, and edge compatibility addresses critical gaps in current models, particularly for creative and mobile applications. While quality trade-offs and limited audio scope warrant attention, the paper’s innovative approach and empirical results establish a new benchmark for accelerated audio generation. Recommendation: Accept