1999 | DANIEL M. BIKEL, RICHARD SCHWARTZ, RALPH M. WEISCHEDEL
This paper introduces IdentiFinder™, a hidden Markov model (HMM) designed to recognize and classify names, dates, times, and numerical quantities in text. The model has been evaluated on English (using data from MUC-6 and MUC-7 and broadcast news) and Spanish (using data from MET-1), as well as speech input (using broadcast news). Results show that IdentiFinder performs consistently better than other learning algorithms and is competitive with handcrafted rule-based systems on mixed case text, while outperforming them on text where case information is not available. The paper also presents an experiment demonstrating the effect of training set size on performance, showing that as little as 100,000 words of training data are sufficient for achieving around 90% performance on newswire text. The authors discuss the reasons behind the algorithm's effectiveness and suggest potential areas for further improvement.This paper introduces IdentiFinder™, a hidden Markov model (HMM) designed to recognize and classify names, dates, times, and numerical quantities in text. The model has been evaluated on English (using data from MUC-6 and MUC-7 and broadcast news) and Spanish (using data from MET-1), as well as speech input (using broadcast news). Results show that IdentiFinder performs consistently better than other learning algorithms and is competitive with handcrafted rule-based systems on mixed case text, while outperforming them on text where case information is not available. The paper also presents an experiment demonstrating the effect of training set size on performance, showing that as little as 100,000 words of training data are sufficient for achieving around 90% performance on newswire text. The authors discuss the reasons behind the algorithm's effectiveness and suggest potential areas for further improvement.