27 Jun 2018 | Joon Son Chung, Arsha Nagrani, Andrew Zisserman
This paper introduces VoxCeleb2, a large-scale audiovisual speaker recognition dataset collected from open-source media, containing over a million utterances from more than 6,000 speakers. The dataset addresses the lack of ethnic diversity in existing speaker recognition datasets and is designed to be robust under noisy and unconstrained conditions. The authors develop and compare Convolutional Neural Network (CNN) models and training strategies for speaker recognition, achieving significant improvements over previous works on a benchmark dataset. The proposed system, named VGGVox, uses a deep CNN architecture to map voice spectrograms to a compact Euclidean space, where distances directly correspond to speaker similarity. The models trained on VoxCeleb2 surpass the performance of previous methods on the VoxCeleb1 test set, demonstrating the effectiveness of the new dataset and the proposed models. The paper also introduces new evaluation protocols using the entire VoxCeleb1 dataset, providing a more comprehensive assessment of speaker verification performance.This paper introduces VoxCeleb2, a large-scale audiovisual speaker recognition dataset collected from open-source media, containing over a million utterances from more than 6,000 speakers. The dataset addresses the lack of ethnic diversity in existing speaker recognition datasets and is designed to be robust under noisy and unconstrained conditions. The authors develop and compare Convolutional Neural Network (CNN) models and training strategies for speaker recognition, achieving significant improvements over previous works on a benchmark dataset. The proposed system, named VGGVox, uses a deep CNN architecture to map voice spectrograms to a compact Euclidean space, where distances directly correspond to speaker similarity. The models trained on VoxCeleb2 surpass the performance of previous methods on the VoxCeleb1 test set, demonstrating the effectiveness of the new dataset and the proposed models. The paper also introduces new evaluation protocols using the entire VoxCeleb1 dataset, providing a more comprehensive assessment of speaker verification performance.