Abstract

Adversarial examples are inputs to machine learning models designed by an adversary to cause an incorrect output. In this work, we perform white-box attack to the state-of-the-art Lingvo automatic speech recognition (ASR) system in the LibriSpeech test dataset. First, we develop effectively imperceptible audio adversarial examples (verified through a human study) by leveraging the psychoacoustic principle of auditory masking, while retaining 100% targeted success rate on arbitrary full-sentence targets. Next, we make progress towards physical-world over-the-air audio adversarial examples by constructing perturbations which remain effective even after applying realistic simulated environmental distortions. The details of the algorithms can be found in our paper and the implementations can be found here.

Imperceptible Adversarial Examples

To construct imperceptible adversarial examples for automatic speech recognition system, we use frequency masking, which refers to the phenomenon that a louder signal can make other signals at nearby frequencies imperceptible. We display two sets of audio examples below. In each set, there is a clean audio, an adversarial example generated by Carlini’s method and our constructed imperceptible adversarial example. Listen to them carefully and choose which one is the clean audio.

First Set


Clean audio: “The sight of you bartley to see you living and happy and successful can I never make you understand what that means to me”

Carlini’s adversarial example: “Hers happened to be in the same frame too but she evidently didn’t care about that” ”

Our imperceptible adversarial example: “Hers happened to be in the same frame too but she evidently didn’t care about that”

Second Set


Carlini's adversarial example: “This was so sweet a lady sir and in some manner i do think she died” ”

Our imperceptible adversarial example: “This was so sweet a lady sir and in some manner i do think she died”

Clean audio: “And to think we can save all that misery and despair by the payment of a hundred and fifty dollars”

Robust Adversarial Examples

Carlini’s adversarial examples and our constructed imperceptible adversarial examples can not work while playing over-the-air. In order to improve the robustness of adversarial examples when playing over-the-air, we use the Image Source Method to create the room impulse responses based on the room configurations (e.g., the room dimension, source audio and target microphone’s location). Then we convolve the room impulse responses with the audio to create artificial utterances (speech with reverberations) that mimic playing the audio over-the-air. Here is an example of a clean audio and its corresponding simulated audio with room reverberation.


Clean audio: “The more she is engaged in her proper duties the less leisure will she have for it even as an accomplishment and a recreation”

Simulated clean audio with reverberation: “The more she is engaged in her proper duties the less leisure will she have for it even as an accomplishment and a recreation”


To make the generated adversarial examples robust to various environments, we consider a challenging setting that the exact configuration of the room in which the attack will be performed is unknown. Instead, we are only aware of the distribution from which the room configuration will be drawn. First, we generate 1000 random room configurations sampled from the distribution as the training room set. The test room set includes another 100 random room configurations sampled from the same distribution. The constructed robust adversarial examples can achieve over 60% attack success rate in the 100 test rooms. Below are two audio samples. One of them is the clean audio simulated playing over-the-air in one test room and another one is our constructed robust adversarial example simulated playing in the same test room. We can clearly hear the noise in the background in the robust adversarial example.


Clean audio with reverberation: “Old dances are simplified of their yearning bleached by time”

Robust adversarial example with reverberation: “You don't seem to realize the position”

Imperceptible & Robust Attacks

By combining both of the techniques we developed above, we can generate both imperceptible and robust adversarial examples, which can achieve around 50% attack success rate in 100 simulated test rooms. Here we display four sets. Each set includes three audio samples convolved with the same simulated room reverberation. One is the clean audio, the others are a robust adversarial example and an imperceptible and robust adversarial example. Listen to them carefully, you should be able to hear the obvious noise in the background in the robust adversarial example. The imperceptible and robust adversarial example is much less imperceptible compared to the robust adversarial example, but can still be differentiated from the clean audio.

First Set


Clean audio: “It is so made that everywhere we feel the sense of punishment”

Robust adversarial example: “Said missus horton a few minutes after”

Imperceptible and robust adversarial example: “Said missus horton a few minutes after”


Second Set


Robust adversarial example: “If spoken to she would not speak again”

Clean audio: “Come and get the boolooroo she said going toward the benches”

Imperceptible and robust adversarial example: “If spoken to she would not speak again”


Third Set


Imperceptible and robust adversarial example: “I suppose that's the wet season too then”

Clean audio: “Were i in the warm room with all the splendor and magnificence”

Robust adversarial example: “I suppose that's the wet season too then”


Fourth Set


Robust adversarial example: “A terrible thought flashed into my mind”

Clean audio: “He's another who's awfully keen about her let me introduce you”

Imperceptible and robust adversarial example: “A terrible thought flashed into my mind”