9 Nov 2020 | Li Wan, Quan Wang, Alan Papir, Ignacio Lopez Moreno
This paper proposes a new loss function called generalized end-to-end (GE2E) loss for speaker verification, which improves model training efficiency compared to the previous tuple-based end-to-end (TE2E) loss. GE2E emphasizes difficult examples during training and does not require an initial example selection stage. The model with GE2E reduces speaker verification EER by over 10% while reducing training time by 60%. Additionally, the paper introduces the MultiReader technique, enabling the model to support multiple keywords and dialects.
Speaker verification (SV) involves verifying if an utterance belongs to a specific speaker using known utterances. It is divided into text-dependent (TD-SV) and text-independent (TI-SV) categories. TD-SV requires phonetic constraints, while TI-SV does not, allowing more variability. The paper focuses on TI-SV and a subtask of TD-SV called global password TD-SV, where verification is based on a detected keyword.
Previous studies used i-vector systems for SV, but recent efforts have focused on neural networks, with end-to-end training being the most successful. In such systems, neural network output vectors are called embedding vectors. GE2E improves upon TE2E by efficiently constructing tuples from input sequences, leading to better performance and faster training.
GE2E uses a similarity matrix to define similarities between each embedding vector and all centroids. It employs two loss functions: softmax and contrast. Softmax pushes embeddings toward their own centroid and away from others, while contrast focuses on difficult pairs. GE2E is more efficient than TE2E, as each update in GE2E is equivalent to multiple updates in TE2E.
The MultiReader technique combines multiple data sources, helping the model perform well on different domains. The paper evaluates GE2E and TE2E on both TD-SV and TI-SV tasks, showing that GE2E outperforms TE2E by over 10% in EER and reduces training time by 60%. The results demonstrate that GE2E is more effective and efficient for speaker verification.This paper proposes a new loss function called generalized end-to-end (GE2E) loss for speaker verification, which improves model training efficiency compared to the previous tuple-based end-to-end (TE2E) loss. GE2E emphasizes difficult examples during training and does not require an initial example selection stage. The model with GE2E reduces speaker verification EER by over 10% while reducing training time by 60%. Additionally, the paper introduces the MultiReader technique, enabling the model to support multiple keywords and dialects.
Speaker verification (SV) involves verifying if an utterance belongs to a specific speaker using known utterances. It is divided into text-dependent (TD-SV) and text-independent (TI-SV) categories. TD-SV requires phonetic constraints, while TI-SV does not, allowing more variability. The paper focuses on TI-SV and a subtask of TD-SV called global password TD-SV, where verification is based on a detected keyword.
Previous studies used i-vector systems for SV, but recent efforts have focused on neural networks, with end-to-end training being the most successful. In such systems, neural network output vectors are called embedding vectors. GE2E improves upon TE2E by efficiently constructing tuples from input sequences, leading to better performance and faster training.
GE2E uses a similarity matrix to define similarities between each embedding vector and all centroids. It employs two loss functions: softmax and contrast. Softmax pushes embeddings toward their own centroid and away from others, while contrast focuses on difficult pairs. GE2E is more efficient than TE2E, as each update in GE2E is equivalent to multiple updates in TE2E.
The MultiReader technique combines multiple data sources, helping the model perform well on different domains. The paper evaluates GE2E and TE2E on both TD-SV and TI-SV tasks, showing that GE2E outperforms TE2E by over 10% in EER and reduces training time by 60%. The results demonstrate that GE2E is more effective and efficient for speaker verification.