MLAAD: The Multi-Language Audio Anti-Spoofing Dataset

MLAAD: The Multi-Language Audio Anti-Spoofing Dataset

16 Apr 2024 | Nicolas M. Müller, Piotr Kawa, Wei Herng Choong, Edresson Casanova, Eren Göğe, Thorsten Müller, Piotr Syga, Philip Sperl, Konstantin Böttinger
The Multi-Language Audio Anti-Spoof Dataset (MLAAD) is a large-scale dataset containing 163.9 hours of synthetic speech in 23 languages, generated using 54 state-of-the-art text-to-speech (TTS) models across 21 architectures. The dataset was created to address the limitations of existing anti-spoofing databases, which are heavily biased toward English and Chinese audio, thereby restricting their global effectiveness. MLAAD aims to democratize anti-spoofing technology by providing a multilingual resource for training deepfake detection models. MLAAD was evaluated against three state-of-the-art deepfake detection models, showing superior performance compared to existing datasets like InTheWild or FakeOrReal. It also complements the ASVspoof 2019 dataset, with both performing well across eight datasets. In tests across eight datasets, MLAAD and ASVspoof 2019 alternately outperformed each other, both excelling on four datasets. The dataset includes detailed metadata for each audio file, such as language, duration, and training data used to generate the TTS model. It also includes a meta.csv file that stores information on the synthesized audio file, its language, and the transcript. The dataset is available for open-source use, and trained models are accessible via an interactive webserver. The synthesis process involves generating audio clips in 23 languages using TTS models, with some audio files being translated from English. The dataset includes both original and synthesized audio files, with the latter used as "fake" audio files for supervised learning. The dataset is designed to improve the generalization of deepfake detection models across different languages and environments. The quality of the synthesized audio was assessed using the automatic speaker recognition tool "Whisper," which converts audio files back into text. The edit distance between the original text and the transcribed text was calculated to evaluate the fidelity of the synthesized audio. The results showed that the synthesized audio quality was comparable to the original audio in many cases, especially for languages present in the original M-AILABS dataset. The study highlights the importance of supporting underrepresented languages in anti-spoofing research, as current performance disparities disproportionately affect individuals from these communities. The results suggest that MLAAD is a valuable resource for improving the detection of audio deepfakes and spoofs across multiple languages.The Multi-Language Audio Anti-Spoof Dataset (MLAAD) is a large-scale dataset containing 163.9 hours of synthetic speech in 23 languages, generated using 54 state-of-the-art text-to-speech (TTS) models across 21 architectures. The dataset was created to address the limitations of existing anti-spoofing databases, which are heavily biased toward English and Chinese audio, thereby restricting their global effectiveness. MLAAD aims to democratize anti-spoofing technology by providing a multilingual resource for training deepfake detection models. MLAAD was evaluated against three state-of-the-art deepfake detection models, showing superior performance compared to existing datasets like InTheWild or FakeOrReal. It also complements the ASVspoof 2019 dataset, with both performing well across eight datasets. In tests across eight datasets, MLAAD and ASVspoof 2019 alternately outperformed each other, both excelling on four datasets. The dataset includes detailed metadata for each audio file, such as language, duration, and training data used to generate the TTS model. It also includes a meta.csv file that stores information on the synthesized audio file, its language, and the transcript. The dataset is available for open-source use, and trained models are accessible via an interactive webserver. The synthesis process involves generating audio clips in 23 languages using TTS models, with some audio files being translated from English. The dataset includes both original and synthesized audio files, with the latter used as "fake" audio files for supervised learning. The dataset is designed to improve the generalization of deepfake detection models across different languages and environments. The quality of the synthesized audio was assessed using the automatic speaker recognition tool "Whisper," which converts audio files back into text. The edit distance between the original text and the transcribed text was calculated to evaluate the fidelity of the synthesized audio. The results showed that the synthesized audio quality was comparable to the original audio in many cases, especially for languages present in the original M-AILABS dataset. The study highlights the importance of supporting underrepresented languages in anti-spoofing research, as current performance disparities disproportionately affect individuals from these communities. The results suggest that MLAAD is a valuable resource for improving the detection of audio deepfakes and spoofs across multiple languages.
Reach us at info@study.space
[slides and audio] MLAAD%3A The Multi-Language Audio Anti-Spoofing Dataset