Understanding Proactive Detection of Voice Cloning with Localized Watermarking

AudioSeal is a novel audio watermarking technique designed for localized detection of AI-generated speech. It employs a generator/detector architecture trained with a localization loss to enable sample-level watermark detection and a novel perceptual loss inspired by auditory masking for improved imperceptibility. AudioSeal achieves state-of-the-art performance in robustness to real-world audio manipulations and imperceptibility, with a fast, single-pass detector that is up to two orders of magnitude faster than existing models. It can detect and localize AI-generated speech with high accuracy, even in the presence of audio edits. AudioSeal also supports multi-bit watermarking, allowing audio to be attributed to a specific model or version. It is evaluated on various audio editing operations and shows superior performance compared to existing methods like WavMark. AudioSeal is also effective in localization, with high accuracy in identifying AI-generated segments in longer audio clips. It can attribute audio to a specific model version with high accuracy. AudioSeal is efficient, with a single-pass detector that provides detection logits for each input sample, enabling real-time detection. It is also robust against adversarial attacks, with the detector's weights kept confidential to maintain effectiveness. AudioSeal is a practical solution for watermarking in voice synthesis APIs, enabling large-scale content provenance and detection of AI-generated content. The research aims to improve transparency and traceability in AI-generated content, but watermarking can have potential misuses such as government surveillance or corporate identification of whistleblowers. Despite these risks, ensuring the detectability of AI-generated content is important, along with advocating for robust security measures and legal frameworks to govern the technology’s use.AudioSeal is a novel audio watermarking technique designed for localized detection of AI-generated speech. It employs a generator/detector architecture trained with a localization loss to enable sample-level watermark detection and a novel perceptual loss inspired by auditory masking for improved imperceptibility. AudioSeal achieves state-of-the-art performance in robustness to real-world audio manipulations and imperceptibility, with a fast, single-pass detector that is up to two orders of magnitude faster than existing models. It can detect and localize AI-generated speech with high accuracy, even in the presence of audio edits. AudioSeal also supports multi-bit watermarking, allowing audio to be attributed to a specific model or version. It is evaluated on various audio editing operations and shows superior performance compared to existing methods like WavMark. AudioSeal is also effective in localization, with high accuracy in identifying AI-generated segments in longer audio clips. It can attribute audio to a specific model version with high accuracy. AudioSeal is efficient, with a single-pass detector that provides detection logits for each input sample, enabling real-time detection. It is also robust against adversarial attacks, with the detector's weights kept confidential to maintain effectiveness. AudioSeal is a practical solution for watermarking in voice synthesis APIs, enabling large-scale content provenance and detection of AI-generated content. The research aims to improve transparency and traceability in AI-generated content, but watermarking can have potential misuses such as government surveillance or corporate identification of whistleblowers. Despite these risks, ensuring the detectability of AI-generated content is important, along with advocating for robust security measures and legal frameworks to govern the technology’s use.

Proactive Detection of Voice Cloning with Localized Watermarking

2024 | Robin San Roman, Pierre Fernandez, Hady Elsahar, Alexandre Défossez, Teddy Furon, Tuan Tran