============================================================================ Interspeech 2020 Reviews for Submission #3039 ============================================================================ Title: Speech Recognition and Multi-Speaker Diarization of Long Conversations Authors: Huanru Henry Mao, Shuyang Li, Julian McAuley and Garrison Cottrell ============================================================================ REVIEWER #1 ============================================================================ Detailed Comments for Authors --------------------------------------------------------------------------- -- Detailed Review -- The paper presents a new transcribed speech corpus, based on the This American Life radio program. The corpus contains 637 hours of professionally transcribed audio, with speaker labels. The paper proposes to use this corpus for studying joint speaker diarization and speech recognition and provides some baseline results for those tasks. The corpus, if released, would be a very valuable data for speech recognition research. Currently, the most widely used free corpus for ASR research is LibriSpeech. However, LibriSpeech contains dictated speech which is not so interesting from the practical perspective. I see that the presented corpus could become a new standard benchmark in both ASR and SD research. As the paper writes (and also based on my short listening tests), the transcripts are not verbatim: sometimes they omit word repetitions and false starts, and correct some grammatical mistakes. It would be inetersting to see what is the WER of the transcripts, compared to the real verbatim transcripts. I think that in addition to ASR and SD, the data could be also used for investigating automatic punctuation insertion and speech activity detection methods. The paper doesn't mention what license will be used to distribute the corpus. Can it be used for developing commercial systems? It would be interesting to see the performance of "classical" SD (based on BIC) and ASR (hybrid HMM-DNN) methods on this dataset. I am fairly confident that they would outperform the proposed end-to-end models on this dataset. Some small remarks: * "We only consider utterances between 3 to 30 seconds." Why? * "All generated outputs that do not terminate with the [US] token are treated as 100% WER." Please explain. * "A reconciliation step is then required to assign the SD model’s time-position speaker labels to the ASR model’s output Y to produce word-level speaker labels S." Typically, SD is done before ASR, so that the ASR output is already "speaker-labeled". * "Instead, we simply use the model’s predicted speaker from the training set as our label for the unseen speaker." Please explain in more detail * VAD based on WebRTC: how accurate it is? -- Key Strength of the paper -- The paper presents a new corpus that could become a new standard benchmark dataset for many tasks, such as ASR, speaker diarization, speech activity detection and punctuation insertion. -- Main Weakness of the paper -- The paper claims to establish the baseline results on the proposed dataset but completely ignores the classical speaker diarization methods and hybrid HMM-DNN ASR model that could perform very well on this dataset. --------------------------------------------------------------------------- Explanation - Quality of the paper --------------------------------------------------------------------------- -- Clarity of Presentation -- The paper is well written and easy to read and understand. -- Quality of References -- TEDLium corpus should be also mentioned in Table 1. -- Reproducibility -- The paper promises to release the model code along with the data upon publication which would make the results straightforward to reproduce. --------------------------------------------------------------------------- ============================================================================ REVIEWER #2 ============================================================================ Detailed Comments for Authors --------------------------------------------------------------------------- -- Detailed Review -- -- Key Strength of the paper -- The demonstration of a straightforward joint ASR and SD model, combined with a relevant data set is appealing. -- Main Weakness of the paper -- More details on the sequence to sequence model parameters, relative to other approaches would be informative in reading the paper (recognizing the authors plan on releasing code). Knowing how to calibrate the ASR models relative to other published work would further strengthen the result --------------------------------------------------------------------------- Explanation - Quality of the paper --------------------------------------------------------------------------- -- Novelty/Originality -- Natural fusion of the two communities (ASR and SD) to a new task, a straightforward incremental result (but encouraging nonetheless) -- Technical Correctness -- No concerns. -- Clarity of Presentation -- No concerns. -- Quality of References -- The main detractor from the paper was the lack of connection between the ASR system and specific existing approaches. The general connection is there, but a specific reference point of comparison would help to calibrate the reader's interpretation of the results for the new data set. -- Reproducibility -- No concerns. --------------------------------------------------------------------------- ============================================================================ REVIEWER #3 ============================================================================ Detailed Comments for Authors --------------------------------------------------------------------------- -- Detailed Review -- The investigated creating rich transcription on a new corpus that will be released if the paper gets accepted. The goal is to produce who said what when for long recordings with multiple parties. To assess performance, it extends an error measure combining speaker and word errors from two party conversations to multi party conversation. It compares a conventional two stage approach of ASR and SD, with an model that does joint optimization, and finds that if good (manual?) segmentation is provided, the joint models is not performing better than the two staged approach, however if no segmentation if provided the joint model outperforms the tow stage approach. One approach uses speaker labels from the training set for diarization which is compared to an approach which estimates speaker embeddings and clusters them. The later yields better results. Several techniques to augment the data set to make it more similar to the long form recognition challenge with changing speakers ar! e investigated and improve performance in particular for the Unaligned recognition task. -- Key Strength of the paper -- Comparison of a joint ASR and SD on a challenging task. The corpus has a significant potential for research for conversational systems. -- Main Weakness of the paper – Given that the new corpus has similarities to the broadcast news corpus it would have been useful to establish baseline performance for ASR and SD using models from that domain. It is unclear if or how an external language model was integrated in the system. Although the corpus contains about 7M. --------------------------------------------------------------------------- Explanation - Quality of the paper --------------------------------------------------------------------------- -- Novelty/Originality -- There are actually several publicly available corpora that have similar properties as TAL. Mainly the broadcast news and radio corpora of usual high audio quality. This corpus seems to be truly public in the sense that it will be made available for free to everyone. It discusses new tasks for rich transcription, and investigates, -- Technical Correctness -- The results are plausible given the used approach. It would have been useful to run a more complex broadcast news speech recognition systems over the data set to see how existing approaches work, especially if the language model is adapted. When the train, dev, test sets were defined, what did you do regarding speaker overlap in these test sets? It seems difficult to avoid overlap especially if the number of hosts is small. Although this seems obvious it should be explicitly mentioned if that is the case. In most cases getting the capitalization of a word right for a given context is not that difficult, however if capitalization after a non spoken period is expected this becomes quite a challenge. I suggest you add an analysis how many of the WER come from not getting the right capitalization because of a missed period. The paper does not explicitly explain how it deals with speaker overlap (x-talk) and back channel. -- Clarity of Presentation -- Overall good. However, some important details regarding generation of train, dev, and test sets are missing (see above). -- Quality of References -- Good. However, it would make sense to refer more previous work related to broadcast news since there are similarities in the data set. For example: "Multistage speaker diarization of broadcast news", IEEE TASL 2006, etc. Also in [15] [20] [35] the year of publication is missing, and in several the location. I suggest to go over the references. -- Reproducibility – The systems are described high level with limited details, which makes it more challenging to reproduce. However, it seems plausible that results can be reproduced if the corpus is released. ---------------------------------------------------------------------------