Nymble: a High-Performance Learning Name-finder

Nymble: a High-Performance Learning Name-finder

| Daniel M. Bikel, Scott Miller, Richard Schwartz, Ralph Weischedel
This paper presents a statistical, learned approach to finding names and other non-recursive entities in text, using a variant of the standard hidden Markov model (HMM). The system, called Nymble, outperforms other learning name-finders and achieves near-human performance, with accuracy often exceeding 90%. The system was developed based on the NE task from the Message Understanding Conference (MUC-6), where the goal was to identify entities like organization names, person names, location names, times, dates, percentages, and money amounts in text using SGML markup. The approach treats name-finding as an information-theoretic problem, modeling the original process that generated name-class-annotated words before they were passed through a noisy channel. The system uses a modified HMM with eight internal states, including a NOT-A-NAME class, and employs a statistical bigram language model for each name-class state. The model is trained using a combination of bigram and unigram probabilities, with back-off models used when training data is insufficient. Nymble uses word-features to distinguish types of numbers and other linguistic patterns, which are language-independent. The system was tested on both English and Spanish, achieving high accuracy. The results show that even with a relatively small amount of training data, Nymble performs nearly as well as handcrafted systems. The system's performance is evaluated using F-measure, which combines precision and recall. The results demonstrate that Nymble achieves high F-measure scores, often exceeding 90%, indicating near-human performance. The paper also discusses the implementation of Nymble, which is built in C++ and uses a general-purpose class library for efficient development and testing. The system's performance is evaluated on both English and Spanish data, with results showing that even with as little as 100,000 words of training data, Nymble achieves performance comparable to handcrafted systems. The paper concludes that the approach to name-finding using a probabilistic model is effective and that the NLP community should continue exploring such methods.This paper presents a statistical, learned approach to finding names and other non-recursive entities in text, using a variant of the standard hidden Markov model (HMM). The system, called Nymble, outperforms other learning name-finders and achieves near-human performance, with accuracy often exceeding 90%. The system was developed based on the NE task from the Message Understanding Conference (MUC-6), where the goal was to identify entities like organization names, person names, location names, times, dates, percentages, and money amounts in text using SGML markup. The approach treats name-finding as an information-theoretic problem, modeling the original process that generated name-class-annotated words before they were passed through a noisy channel. The system uses a modified HMM with eight internal states, including a NOT-A-NAME class, and employs a statistical bigram language model for each name-class state. The model is trained using a combination of bigram and unigram probabilities, with back-off models used when training data is insufficient. Nymble uses word-features to distinguish types of numbers and other linguistic patterns, which are language-independent. The system was tested on both English and Spanish, achieving high accuracy. The results show that even with a relatively small amount of training data, Nymble performs nearly as well as handcrafted systems. The system's performance is evaluated using F-measure, which combines precision and recall. The results demonstrate that Nymble achieves high F-measure scores, often exceeding 90%, indicating near-human performance. The paper also discusses the implementation of Nymble, which is built in C++ and uses a general-purpose class library for efficient development and testing. The system's performance is evaluated on both English and Spanish data, with results showing that even with as little as 100,000 words of training data, Nymble achieves performance comparable to handcrafted systems. The paper concludes that the approach to name-finding using a probabilistic model is effective and that the NLP community should continue exploring such methods.
Reach us at info@study.space
[slides] Nymble%3A a High-Performance Learning Name-finder | StudySpace