============================================================================ Interspeech 2019 Reviews for Submission #1353 ============================================================================ Title: Universal Adversarial Perturbations for Speech Recognition Systems Authors: Paarth Neekhara, Shehzeen Hussain, Prakhar Pandey, Shlomo Dubnov, Julian McAuley and Farinaz Koushanfar ============================================================================ REVIEWER #1 ============================================================================ Detailed Comments for Authors --------------------------------------------------------------------------- -- Detailed Review -- The authors demonstrate the existence of universal adversarial audio perturbations that cause mis-transcription of audio signals by automatic speech recognition systems. -- Key Strength of the paper -- The proposed approach and experiment results are very interesting. As the authors state in the conclusions, I think the proposed approach can provide insights for building more robust neural network. -- Main Weakness of the paper -- --------------------------------------------------------------------------- Explanation - Quality of the paper --------------------------------------------------------------------------- -- Novelty/Originality -- The basic idea of this study is borrowed from the issue previously explored in image processing domain. I think this study is not so original but very interesting. -- Technical Correctness -- I think this paper is technically correct. -- Clarity of Presentation -- I think this paper is written very clearly. Some references like (??) to equations need to be revised. -- Quality of References -- The references seem to be generally good. -- Reproducibility -- I think the research outcome can be reproduced, but input from the authors may be necessary. --------------------------------------------------------------------------- ============================================================================ REVIEWER #2 ============================================================================ Detailed Comments for Authors --------------------------------------------------------------------------- -- Detailed Review -- Paper about about adding a carefully generated additive noise to speech to let the victim ASR to mis-transcribe the speech. Experiments are on public data with public ASR. Authors are inspired from the image community. Novetly of this paper comes from the problem of adversial attack on audio (sequence = transcription, not the speaker identity) -- Key Strength of the paper -- Looks like good theoretical background, experiments (with cross-ASR one), examples provided -- Main Weakness of the paper -- Few typos and not fully clear if the noise is utterance specific and what happens if the one utterance noise is add to another utterance. --------------------------------------------------------------------------- Explanation - Quality of the paper --------------------------------------------------------------------------- -- Novelty/Originality -- It look novel enough, though there are already some works on this topic -- Technical Correctness -- OK. I'm not expert in this area -- Clarity of Presentation -- Can be improved for not experts in this topic. -- Quality of References -- Good enough -- Reproducibility -- Looks good as it is on open data. --------------------------------------------------------------------------- ============================================================================ REVIEWER #3 ============================================================================ Detailed Comments for Authors --------------------------------------------------------------------------- In this paper the authors investigate the possibility of perturbing of target audio with a universal audio signal (i.e., same across all target signal) which can lead to adversarial transcription (i.e., the original transcript should not be estimatable). The proposed investigation is informative and interesting however there are certain aspects of the experiment design which might need more attention. These are described below: 1. The distortion metric (3.2) measures average loudness. However short and loud bursty noise can also have low average loudness but severely affect the listeners experience due to the bursty nature. Hence a distortion metric which more closely reflects human listener's distortion scores might be suitable. In the current set of samples provided it turns out that the learnt universal signal does not have bursty nature however the cost function does not explicitly penalize this. 2. The authors use character error rate as the metric to measure non-decipherability of adversarial transcript. However strong ASR systems, even end-to-end systems, typically rely on longer target units (e.g. words/sub-words) or rely on language models which might result in the generation of linguistically plausible transcripts even for noisy non-decipherable audio. The affect of these attacks on models with "more linguistic strength" is not clear. 3. Modern ASR systems are typically very robust to distortions like white noise. Using noise sources with more complex distributions (e.g. traffic noise or car noise) or noise types which are non-additive (e.g. convolutive distortions like reverberation or more complex non-linear distortions like radio channel distortions) can lead to audio which is very understandable by humans but still affects ASR systems adversely. Comparing the universal perturbation vector with such realistic noises would help understand the effectiveness of this attack. 4. Cross-model transferability : Wavenet and Deepspeech could be considered similar class of models. It would be interesting to model-agnostic nature of this universal perturbation vector across models which vary more widely. Other comments: ---------------------- There are several cross-referencing errors in the paper i.e., many ?? in place of cross-references. In Section 2 authors seem to conflate end-to-end models and DNN models. There exist widely used models like HMM-DNN models which are not necessarily considered end-to-end. --------------------------------------------------------------------------- Explanation - Quality of the paper --------------------------------------------------------------------------- -- Novelty/Originality -- -- Technical Correctness -- -- Clarity of Presentation -- There are minor typos in the paper. -- Quality of References -- If a paper is available both in a conference/journal and on arxiv please refer to the conference/journal. This would help users estimate the review process the paper might have gone through. -- Reproducibility -- ---------------------------------------------------------------------------