October 1996; Revised October 1997 | Eric Sven Ristad, Peter N. Yianilos
This report presents a stochastic model for string edit distance, allowing the automatic learning of string edit distance functions from a corpus of examples. The model is applied to the challenging problem of learning word pronunciations in conversational speech, achieving a fourfold reduction in error rate compared to the untrained Levenshtein distance. The approach is applicable to any string classification problem that can be solved using a similarity function against a database of labeled prototypes.
The model is based on a memoryless stochastic transducer that generates edit sequences between strings. The transducer is trained using an expectation maximization (EM) algorithm to estimate the parameters of the model. The model defines two string distances: the Viterbi edit distance, which is based on the most likely edit sequence, and the stochastic edit distance, which aggregates all possible edit sequences. The stochastic edit distance is shown to be more effective in capturing the similarity between strings.
The model is evaluated on the Switchboard corpus of conversational speech, where it is applied to the task of learning word pronunciations. Four experiments are conducted, varying the amount of information available in the pronouncing lexicon. The results show that the stochastic model significantly outperforms the Levenshtein distance, with error rates as low as 2.4% in some cases. The model's ability to learn from a single example of a new word's pronunciation suggests its potential for practical applications in speech recognition and other string classification tasks.This report presents a stochastic model for string edit distance, allowing the automatic learning of string edit distance functions from a corpus of examples. The model is applied to the challenging problem of learning word pronunciations in conversational speech, achieving a fourfold reduction in error rate compared to the untrained Levenshtein distance. The approach is applicable to any string classification problem that can be solved using a similarity function against a database of labeled prototypes.
The model is based on a memoryless stochastic transducer that generates edit sequences between strings. The transducer is trained using an expectation maximization (EM) algorithm to estimate the parameters of the model. The model defines two string distances: the Viterbi edit distance, which is based on the most likely edit sequence, and the stochastic edit distance, which aggregates all possible edit sequences. The stochastic edit distance is shown to be more effective in capturing the similarity between strings.
The model is evaluated on the Switchboard corpus of conversational speech, where it is applied to the task of learning word pronunciations. Four experiments are conducted, varying the amount of information available in the pronouncing lexicon. The results show that the stochastic model significantly outperforms the Levenshtein distance, with error rates as low as 2.4% in some cases. The model's ability to learn from a single example of a new word's pronunciation suggests its potential for practical applications in speech recognition and other string classification tasks.