This paper presents targeted audio adversarial examples that can be used to manipulate automatic speech recognition (ASR) systems. The authors demonstrate that given any audio waveform, it is possible to generate a nearly inaudible perturbation that causes the ASR system to transcribe the audio as any desired phrase. They apply their white-box iterative optimization-based attack to Mozilla's DeepSpeech end-to-end ASR system and achieve a 100% success rate. The feasibility of this attack introduces a new domain for studying adversarial examples.
The paper introduces a new approach to constructing adversarial examples in the audio domain, which is significantly more challenging than in the image domain. The authors show that their method can generate adversarial examples with a mean distortion of -31dB, which is comparable to the difference between ambient noise and a person talking. They also demonstrate that their method can be applied to a variety of scenarios, including generating adversarial examples from non-speech audio and hiding speech from being transcribed.
The paper also discusses the properties of audio adversarial examples, including their robustness to various forms of noise and their transferability across different ASR systems. The authors find that audio adversarial examples have different properties from those on images, and that linearity does not hold on the audio domain. They also raise several open questions, including whether universal adversarial perturbations exist and whether audio adversarial examples are transferable across different ASR systems.
The authors conclude that targeted audio adversarial examples are effective on ASR systems and that their method can be used to generate adversarial examples with a high success rate. They also highlight the importance of further research into audio adversarial examples and the need to develop robust defenses against them.This paper presents targeted audio adversarial examples that can be used to manipulate automatic speech recognition (ASR) systems. The authors demonstrate that given any audio waveform, it is possible to generate a nearly inaudible perturbation that causes the ASR system to transcribe the audio as any desired phrase. They apply their white-box iterative optimization-based attack to Mozilla's DeepSpeech end-to-end ASR system and achieve a 100% success rate. The feasibility of this attack introduces a new domain for studying adversarial examples.
The paper introduces a new approach to constructing adversarial examples in the audio domain, which is significantly more challenging than in the image domain. The authors show that their method can generate adversarial examples with a mean distortion of -31dB, which is comparable to the difference between ambient noise and a person talking. They also demonstrate that their method can be applied to a variety of scenarios, including generating adversarial examples from non-speech audio and hiding speech from being transcribed.
The paper also discusses the properties of audio adversarial examples, including their robustness to various forms of noise and their transferability across different ASR systems. The authors find that audio adversarial examples have different properties from those on images, and that linearity does not hold on the audio domain. They also raise several open questions, including whether universal adversarial perturbations exist and whether audio adversarial examples are transferable across different ASR systems.
The authors conclude that targeted audio adversarial examples are effective on ASR systems and that their method can be used to generate adversarial examples with a high success rate. They also highlight the importance of further research into audio adversarial examples and the need to develop robust defenses against them.