============================================================================ Interspeech 2019 Reviews for Submission #3099 ============================================================================ Title: Expediting TTS Synthesis with Adversarial Vocoding Authors: Paarth Neekhara, Chris Donahue, Miller Puckette, Shlomo Dubnov and Julian McAuley ============================================================================ REVIEWER #1 ============================================================================ Detailed Comments for Authors --------------------------------------------------------------------------- -- Detailed Review -- The paper proposes that it is enough to focus on one of either spectral magnitude or phase, and rely on heuristics for the other, to reach good results in vocoding. It focuses on improving the amplitude spectrum by using a GAN model first proposed in image processing and it applies it to the spectrogram as the input image, then uses LWS to get a synthesis improved over the usual poor baseline that Griffin-Lim provides. Although the author claim "high-quality" speech, the results and samples shared seem to indicate that this system interest lies in the speed-up it provides while preserving "reasonable" quality (apart from being a better baseline). -- Key Strength of the paper -- - makes several independent and interesting points - the methodology seems okay and the results are interesting - detailed experimental protocol, parameters, etc. -- Main Weakness of the paper -- - You cite SEGAN, but there have been several papers in the recent years using GANs to improve the output of the TTS text-to-feature blocks. It has been done with MGCs, mel-spectrograms and linear spectrograms, not only in SEGAN. See work by Kaneko et al for instance. And the results that you obtain, and the samples I listened to, clearly seem to indicate that you are below WaveGlow and WaveNet (and therefore very likely below Parallel WaveNet). So the main contribution of this paper seem to be introducing a new baseline system or a kind of "compromise": lesser quality for higher speed. In any case, given the results of table 2, 3 and 4, I would argue that using the term "high-quality" speech in the conclusion is a bit too strong a statement. - the samples I listened to are very reverberant, could this explain that the recording score "only" around 4 in every the MOS tests? and what would be the impact of using reverberant audio vs. clean speech (as usually used in TTS) on your results? - MOS tests are compared in various places and significant conclusions are drawn from these comparisons. I recommend using MUSHRA for such important conclusions. clarification: I wrote this before reaching the part in section 5.1 where it is unclear whether all stimuli are presented simultaneously or separately to the judges (see below). - there are a several points to clarify: - Section 2: have you evaluated the impact of using full-band mel-spectrograms (up to 11kHz) rather than having the cut-off at 7.6khz? - Section 2: "a systematic evaluation ... has never been concluded", I suspect it has been done but I'll accept that it may not have been peer-reviewed anywhere. To the best of my knowledge at one time or another, most PhD students working on vocoders and spectral representation go through that step of analysis-synthesis with replacing either of the synthetic amplitude or the phase with real values (or vice versa), if only "to 'see' how it sounds". - Section 2: IMO the secondary conclusion is actually more important than the first (even more so if the first conclusion relies on comparing absolute MOS scores rather than preference tests or MUSHRAs) - Section 4: is the architecture of the GAN model used different from what can be found in other publications? might be part of the novelty aspect of this paper? - Section 4.1: can you discuss and give more information about those "patches" in the Discriminator L1 loss? dimension (how many bins?), resolution (all bins or sub-sampled?), impact on the convergence, etc. - Section 5: how many frames ('n' in 'n x 513') are there per batch? - Section 5: what was the stopping criterion after 100k iterations, was it convergence, lack of resources, overfitting? How were the learning rate and the regularization parameter decided? - Section 5.1: "We cannot directly compare to the Parallel WaveNet approach because it is an end-to-end TTS method rather than a vocoder". This is simply incorrect. There is absolutely nothing that prevents Parallel WaveNet to be trained on mel-spectrograms. Parallel WaveNet is a different model (and training process) than WaveNet but, exactly like WaveNet, it can be conditioned on either linguistic and prosodic information [7] or on mel-spectrograms [6], (last I heard, "Tacotron --> mel-spectrograms --> Parallel WaveNet" is what Google acknowledged to use in their current systems). - Section 5.1: "we show ... randomly-ordered batch" it is unclear whether you present the 6 randomized versions of the same utterance simultaneously with an upper anchor to the listeners on a single page or you present those samples in separate independent pages. That would make your MOS scores closer to a MUSHRA and comparable. - Table 4: why use SpecGan for GL but MelSpecGAN with AdVoc? - various typos, for instance "psuedoinverse" several times, please proof-read (ironically, there are very likely several typos and grammar mistakes in this very review) --------------------------------------------------------------------------- Explanation - Quality of the paper --------------------------------------------------------------------------- -- Novelty/Originality -- I'm on the fence concerning on the novelty aspect of this paper: - there have been several papers in the recent years using GANs to improve the output of the TTS text-to-feature block (Tacotron or SPSS). It has been done with MGCs, mel-spectrograms and spectrograms, but maybe the GAN model being used here is different from those papers, this is not made clear by the authors. - Playing with all the combinations synthetic/real amplitude/phase spectrum is something most PhD students in the domain will have done at one point or another, but I'm not sure I can point at a formal systematical analysis of it and I was interested to see the one here. - the novelty aspect seems to be more on the resynthesis using LWS, which gives a significant speed-up to the system while improving over Griffin-Lim, rather than the GAN itself, but that resynthesis is the last step of the system and has no impact on the training of the GAN, so there is no reasons why it wouldn't work with any other existing GAN (or even non-GAN) post-filter approaches. On the other hand, the GAN improving the spectrogram is what makes LWS work better. -- Technical Correctness -- - MOS tests are compared in various places in a way that could significantly impact the conclusions of the paper, but it is possible that they are actually done presenting all the stimuli to the judges which would make it closer to a MUSHRA. Clarification asked to the authors. - some inaccuracies (cf. Parallel Wavenet) -- Clarity of Presentation -- There are a few points to clarify (cf. review) -- Quality of References -- Only SEGAN was cited while several other GAN spectral post-filter approaches are missing. In this case GAN is an important part of the paper so I think it's worth growing the list a bit. I mentioned Kaneko et al because that was the first to come to mind, but obviously the authors should feel free to reference whichever papers make more sense to them. -- Reproducibility -- hyper-parameters and code provided, training data available, samples of results available as well --------------------------------------------------------------------------- ============================================================================ REVIEWER #2 ============================================================================ Detailed Comments for Authors --------------------------------------------------------------------------- -- Detailed Review -- The authors presented their word in attempting to generate spectrograms rather than waveforms with a neural vocoder. In this paper they were able to justify experimentally their approach with regards to phase estimation. And showed respectable results with their new methodology. I was able to clone their repository and run their examples without much problem. The only criticism I would make is that they did not try to predict both the magnitude and phase as opposed to just one, I also think that the statement that modern DL methodologies would learn the magnitude spectrum better does need justification. Otherwise I find the paper to be of great quality. -- Key Strength of the paper -- Well written, good experimental methodology, and easy to reproduce. -- Main Weakness of the paper -- Short discussion --------------------------------------------------------------------------- Explanation - Quality of the paper --------------------------------------------------------------------------- -- Novelty/Originality -- I have not come across this idea before. -- Technical Correctness -- Experiments and statistics are sound. -- Clarity of Presentation -- Well written though at times the paper does seem to make statements without justification. -- Quality of References -- Seems complete and covering important works -- Reproducibility -- Source code & pretrained models available --------------------------------------------------------------------------- ============================================================================ REVIEWER #3 ============================================================================ Detailed Comments for Authors --------------------------------------------------------------------------- -- Detailed Review -- This paper suggests an alternative approach to overcome TTS computational bottleneckswhen spectral representations are vocoded into waveforms. The authors propose a GAN for magnitude estimation and a recent phase reconstruction method. The proposed solution seems to be faster than state-of-the-art and the results are quite good. -- Key Strength of the paper -- The paper is very well written and clear. However it seems it spends a significant amount of space (and time) to show something that in my humble opinion is well-known: that good phase esitmation is more important than good magnitude estimation in speech synthesis (see "Advances in phase-aware signal processing in speech communication", SpeCom, Vol. 81, among others). However, the experimental approach in this question is interesting. The work is evaluated on a well-known corpus and samples are available for listening. The proposed approach is demonstrated to be faster than SOTA without significant quality loss. -- Main Weakness of the paper -- None. --------------------------------------------------------------------------- Explanation - Quality of the paper --------------------------------------------------------------------------- -- Novelty/Originality -- The novelty lies in the fusion of GAN-based magnitude estimation + LWS phase reconstruction. -- Technical Correctness -- Very solid, contains all the necessary information, statistically justified. -- Clarity of Presentation -- Very clear, excellent use of English. -- Quality of References -- Very good. -- Reproducibility -- Can be reproduced with some effort. The experimental setup is clearly explained and the corpus is public. ---------------------------------------------------------------------------