Understanding Nymble%3A a High-Performance Learning Name-finder

This paper presents a statistical, learned approach to finding names and other non-recursive entities in text using a modified hidden Markov model (HMM). The system, named "Nymble," outperforms existing learning name-finders and achieves near-human accuracy, often reaching an F-measure of 90% or higher. Nymble is designed to handle both English and Spanish text, with the latter being trained on MET, a multi-lingual entity task. The approach models the raw text as if it had passed through a noisy channel, where named entities were originally marked. The model uses a generative process to find the most likely sequence of name-classes given a sequence of words, employing Bayes' Rule to maximize the likelihood. The system includes a detailed description of the model, including the use of bigram language models and back-off strategies for unknown words. Nymble's performance is evaluated on both English and Spanish data, showing that it can achieve high accuracy with relatively small training sets. The paper also discusses future improvements, such as incorporating lists of organizations, person names, and locations, and extending the model to handle longer-distance information. Overall, the study demonstrates the effectiveness of probabilistic, stochastic approaches in natural language processing tasks.This paper presents a statistical, learned approach to finding names and other non-recursive entities in text using a modified hidden Markov model (HMM). The system, named "Nymble," outperforms existing learning name-finders and achieves near-human accuracy, often reaching an F-measure of 90% or higher. Nymble is designed to handle both English and Spanish text, with the latter being trained on MET, a multi-lingual entity task. The approach models the raw text as if it had passed through a noisy channel, where named entities were originally marked. The model uses a generative process to find the most likely sequence of name-classes given a sequence of words, employing Bayes' Rule to maximize the likelihood. The system includes a detailed description of the model, including the use of bigram language models and back-off strategies for unknown words. Nymble's performance is evaluated on both English and Spanish data, showing that it can achieve high accuracy with relatively small training sets. The paper also discusses future improvements, such as incorporating lists of organizations, person names, and locations, and extending the model to handle longer-distance information. Overall, the study demonstrates the effectiveness of probabilistic, stochastic approaches in natural language processing tasks.

Nymble: a High-Performance Learning Name-finder

| Daniel M. Bikel, Scott Miller, Richard Schwartz, Ralph Weischedel