View Reviews

Paper ID
169
Paper Title
Expressive Neural Voice Cloning
Track Name
Conference
Reviewer #1

Questions

  • 1. Summary of the paper (Summarize the main claims/contributions of the paper.)
    • This paper introduces a method disentangling the style and speaker-specific information to allow for fine-grained controllable expressive styles in the speech synthesis task.
  • 2. Clarity (Assess the clarity of the presentation and reproducibility of the results.)
    • Above Average
  • 3. Correctness (Is the paper technically correct?)
    • Paper is technically correct
  • 4. Novelty (Does the paper show sufficiently new findings or just known results?)
    • Above Average
  • 5. Significance (Does the paper contribute a major breakthrough or an incremental advance?)
    • Above Average
  • 6. Overall Rating
    • Accept
  • 7. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.)
    • The paper is well written with a clear discussion of existing works that this proposed model builts upon. While most of the components of the proposed framework are derived from previous works, the combination of latent and heuristically derived style information seems to work well for the expressive voice synthesis task. The sample results are well presented, and give a clear illustration of the model performance for various tasks.

      One suggestion would be to include illustrations of the core component architectures such as the speaker encoder and Mel-spectrogram synthesizer (for this, the authors could place the contents of section 4.2 under section 3.2).
  • 8. Reviewer confidence
    • Reviewer is knowledgeable
Reviewer #2

Questions

  • 1. Summary of the paper (Summarize the main claims/contributions of the paper.)
    • The paper proposed a controllable voice cloning method that allows fine-grained control over various style aspects of the synthesized speech for an unseen speaker.
  • 2. Clarity (Assess the clarity of the presentation and reproducibility of the results.)
    • Above Average
  • 3. Correctness (Is the paper technically correct?)
    • Paper is technically correct
  • 4. Novelty (Does the paper show sufficiently new findings or just known results?)
    • Above Average
  • 5. Significance (Does the paper contribute a major breakthrough or an incremental advance?)
    • Above Average
  • 6. Overall Rating
    • Weak Accept
  • 7. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.)
    • The paper investigated the problem of cloning a new speaker’s voice with fine-grained control over the style aspects of the generated speech, which has not been explored in the literature. It proposed a conditional generative model conditioned on speaker and style characteristics derived from the target audio of a given text to solve the problem. In particular, the speaker encoder module allows to embeddings for speakers not seen during training using a few reference speech samples. The conditioning pitch contour and rhythm of the generated speech and latent style tokens allows to produce voice with more diverse styles. Extensive experiments are conducted on three benchmark tasks to clarify the efficacy of the proposed method. The paper clearly clarifies the research gap with the literature and defines the learning goal, thus well motivating this work.

      However, I have some questions and suggestions described as follows:
      1. Why warm start the model using the pre-trained Mellotron?
      2. In Section 4.1, “Past works on voice cloning Wan et al. (2017); Arik et al. (2018) trained their synthesis models on the LibriSpeech dataset Panayotov et al. (2015) and empirically demonstrated the importance of a speaker-diverse training dataset for the task of voice cloning.”. It is not clear why clarifies this. And it arises a new question that it seems you use the dataset different from literature work?
      3. Why speaker encoder is trained on other datasets, instead on the dataset that is used to train your model? Also for Vocoder. Then may the three modules in the model be able to be trained together?
      4. Why there is no results of Tacotron2+GST Zeroshot on style transfer?
      5. The caption in Table 1 should make it clear that 1) the three objective error metrics is for Imitation task; 2) larger mean opinion scores are better; 3) the best results are suggested to mark in bold (same for other table results).
      6. The training for Vocoder is supposed to be described in experiment part instead of in method part.
      In the method part, it may include more principles to train such module.
      7. The attention map to obtain the latent variable rhythm is described not very clearly. What is its concrete mathematical formulation?
  • 8. Reviewer confidence
    • Reviewer is knowledgeable
Reviewer #3

Questions

  • 1. Summary of the paper (Summarize the main claims/contributions of the paper.)
    • This work proposes a voice cloning model that can control various style aspects of the synthesized speech for an unseen speaker. This is achieved by learning an auto-encoder model that maps different style information i.e., speaker, GST and text to feature vectors and then recovering voice from these feature vectors. The authors evaluate their method in text-to-speech, speech imitation and speech style transformation tasks.
  • 2. Clarity (Assess the clarity of the presentation and reproducibility of the results.)
    • Above Average
  • 3. Correctness (Is the paper technically correct?)
    • Paper is technically correct
  • 4. Novelty (Does the paper show sufficiently new findings or just known results?)
    • Below Average
  • 5. Significance (Does the paper contribute a major breakthrough or an incremental advance?)
    • Below Average
  • 6. Overall Rating
    • Weak Reject
  • 7. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.)
    • The proposed framework is similar to existing image-to-image transformation[1,2], text-to-speech transformation methods[3,4]. These methods first learn disentangled feature representations and then generate target images or speech from proper feature vectors. After reading this work, I have several concerns.

      1. My main concern is that the authors didn't compare their methods with other state-of-the-art methods in the experimental section, which makes their results less convincing. More state-of-the-art methods need to be added to Tables 1 and 2.

      2. The authors claim that "the pitch contours are derived from the target speech using the Yin algorithm". I wonder how the pitch contours can be derived in the inference process since the target speech is unknown.

      3. I wonder whether using the adversarial learning framework can improve the performance of the proposed method.

      [1] UNIT: UNsupervised Image-to-image Translation Networks
      [2] StarGAN v2: Diverse Image Synthesis for Multiple Domains
      [3] Zero-shot multi-speaker text-to-speech with state-of-the-art neural speaker embeddings
      [4] Using personalized speech synthesis and neural language generator for rapid speaker adaptation
  • 8. Reviewer confidence
    • Reviewer is an expert