VoxCeleb: a large-scale speaker identification dataset

VoxCeleb: a large-scale speaker identification dataset

30 May 2018 | Arsha Nagrani†, Joon Son Chung†, Andrew Zisserman
This paper presents VoxCeleb, a large-scale speaker identification dataset collected from real-world conditions. The dataset contains over 100,000 utterances for 1,251 celebrities, extracted from videos uploaded to YouTube. The dataset is gender balanced and includes a wide range of ethnicities, accents, professions, and ages. It is fully automated and scalable, using computer vision techniques to identify and verify speakers without human annotation. The paper also introduces a novel CNN architecture for speaker identification and verification. The CNN is designed to handle variable-length audio inputs and outperforms traditional state-of-the-art methods on the VoxCeleb dataset. The CNN architecture uses average pooling for variable-length test data and is compared to other variants, showing the importance of variance normalization. The paper compares the performance of the CNN with traditional methods such as GMM-UBM and i-vectors/PLDA. The results show that the CNN provides superior performance for both speaker identification and verification. For identification, the CNN achieves an 80.5% top-1 classification accuracy over 1,251 different classes, which is almost 20% higher than traditional methods. For verification, the CNN provides a significant improvement over the baselines, with the embedding being the crucial step. The paper also discusses the experimental setup for both speaker identification and verification, and compares the performance of the CNN baseline to traditional state-of-the-art methods on the VoxCeleb dataset. The results show that the CNN provides superior performance for both tasks. The paper concludes that the proposed CNN architecture is effective for speaker identification and verification in real-world conditions.This paper presents VoxCeleb, a large-scale speaker identification dataset collected from real-world conditions. The dataset contains over 100,000 utterances for 1,251 celebrities, extracted from videos uploaded to YouTube. The dataset is gender balanced and includes a wide range of ethnicities, accents, professions, and ages. It is fully automated and scalable, using computer vision techniques to identify and verify speakers without human annotation. The paper also introduces a novel CNN architecture for speaker identification and verification. The CNN is designed to handle variable-length audio inputs and outperforms traditional state-of-the-art methods on the VoxCeleb dataset. The CNN architecture uses average pooling for variable-length test data and is compared to other variants, showing the importance of variance normalization. The paper compares the performance of the CNN with traditional methods such as GMM-UBM and i-vectors/PLDA. The results show that the CNN provides superior performance for both speaker identification and verification. For identification, the CNN achieves an 80.5% top-1 classification accuracy over 1,251 different classes, which is almost 20% higher than traditional methods. For verification, the CNN provides a significant improvement over the baselines, with the embedding being the crucial step. The paper also discusses the experimental setup for both speaker identification and verification, and compares the performance of the CNN baseline to traditional state-of-the-art methods on the VoxCeleb dataset. The results show that the CNN provides superior performance for both tasks. The paper concludes that the proposed CNN architecture is effective for speaker identification and verification in real-world conditions.
Reach us at info@study.space
[slides] VoxCeleb%3A A Large-Scale Speaker Identification Dataset | StudySpace