VoxCeleb2: Deep Speaker Recognition

VoxCeleb2: Deep Speaker Recognition

27 Jun 2018 | Joon Son Chung, Arsha Nagrani, Andrew Zisserman
This paper presents a deep speaker recognition system called VGGVox, trained on a large-scale audio-visual speaker recognition dataset called VoxCeleb2. The main contributions of this work are the creation of VoxCeleb2, a dataset containing over a million utterances from over 6,000 speakers, and the development of a deep CNN-based neural speaker embedding system for speaker recognition under noisy and unconstrained conditions. VoxCeleb2 is collected from open-source media and includes speech from speakers of 145 different nationalities, covering a wide range of accents, ages, ethnicities, and languages. The dataset is audio-visual, making it useful for various applications such as visual speech synthesis, speech separation, and cross-modal transfer between face and voice. The dataset is also multilingual and contains both development and test sets. The VGGVox system is trained on the VoxCeleb2 dataset to learn speaker discriminative embeddings. The system consists of three main components: a deep CNN trunk architecture for feature extraction, a pooling method for feature aggregation, and a pairwise loss function for direct optimization of the mapping. The system is evaluated on the VoxCeleb1 dataset and outperforms previous works significantly. The paper also introduces new evaluation protocols for speaker verification, including a new test set using the entire VoxCeleb1 dataset and a test set within the same nationality and gender. The results show that the VGGVox system achieves state-of-the-art performance on these test sets. The VoxCeleb2 dataset is available for download and provides a large-scale, diverse, and challenging dataset for speaker recognition research. The paper also discusses the challenges of speaker recognition under noisy and unconstrained conditions and proposes a deep CNN-based approach that effectively addresses these challenges. The system is trained using a contrastive loss and pre-trained using a softmax loss to improve performance. The paper also discusses test time augmentation strategies to improve performance. The results show that the VGGVox system achieves significant improvements in speaker verification performance on the VoxCeleb1 dataset.This paper presents a deep speaker recognition system called VGGVox, trained on a large-scale audio-visual speaker recognition dataset called VoxCeleb2. The main contributions of this work are the creation of VoxCeleb2, a dataset containing over a million utterances from over 6,000 speakers, and the development of a deep CNN-based neural speaker embedding system for speaker recognition under noisy and unconstrained conditions. VoxCeleb2 is collected from open-source media and includes speech from speakers of 145 different nationalities, covering a wide range of accents, ages, ethnicities, and languages. The dataset is audio-visual, making it useful for various applications such as visual speech synthesis, speech separation, and cross-modal transfer between face and voice. The dataset is also multilingual and contains both development and test sets. The VGGVox system is trained on the VoxCeleb2 dataset to learn speaker discriminative embeddings. The system consists of three main components: a deep CNN trunk architecture for feature extraction, a pooling method for feature aggregation, and a pairwise loss function for direct optimization of the mapping. The system is evaluated on the VoxCeleb1 dataset and outperforms previous works significantly. The paper also introduces new evaluation protocols for speaker verification, including a new test set using the entire VoxCeleb1 dataset and a test set within the same nationality and gender. The results show that the VGGVox system achieves state-of-the-art performance on these test sets. The VoxCeleb2 dataset is available for download and provides a large-scale, diverse, and challenging dataset for speaker recognition research. The paper also discusses the challenges of speaker recognition under noisy and unconstrained conditions and proposes a deep CNN-based approach that effectively addresses these challenges. The system is trained using a contrastive loss and pre-trained using a softmax loss to improve performance. The paper also discusses test time augmentation strategies to improve performance. The results show that the VGGVox system achieves significant improvements in speaker verification performance on the VoxCeleb1 dataset.
Reach us at info@study.space
Understanding VoxCeleb2%3A Deep Speaker Recognition