KEA: Practical Automatic Keyphrase Extraction

KEA: Practical Automatic Keyphrase Extraction

1999 | Ian H. Witten, Gordon W. Paynter, Eibe Frank, Carl Gutwin and Craig G. Nevill-Manning
Kea is an algorithm for automatically extracting keyphrases from text. It consists of two stages: training and extraction. During training, a model is created using documents with known keyphrases. In extraction, the model is used to identify keyphrases in new documents. Kea identifies candidate phrases through three steps: text cleaning, candidate identification, and stemming/case-folding. It calculates two features for each candidate: TF×IDF, which measures a phrase's frequency in a document compared to its general rarity, and first occurrence, which is the distance of the phrase's first appearance in the document. The model uses these features to predict whether a candidate is a keyphrase. Kea uses the Naïve Bayes technique for machine learning, which is simple and effective. Kea's performance was evaluated using documents from the New Zealand Digital Library. It was found that Kea can on average match between one and two of the five keyphrases chosen by the author in this collection. Kea works well with a training set of as few as 20 documents and performs best on full text rather than titles and abstracts. The global document corpus for TF×IDF can contain as few as 10 documents. Kea is available from the New Zealand Digital Library project. The authors plan to expand the evaluation of the algorithm in the future, including testing with human expert judges and comparing Kea to other document summarization methods. Kea provides useful metadata where none existed before, offering a valuable tool for digital library designers and users.Kea is an algorithm for automatically extracting keyphrases from text. It consists of two stages: training and extraction. During training, a model is created using documents with known keyphrases. In extraction, the model is used to identify keyphrases in new documents. Kea identifies candidate phrases through three steps: text cleaning, candidate identification, and stemming/case-folding. It calculates two features for each candidate: TF×IDF, which measures a phrase's frequency in a document compared to its general rarity, and first occurrence, which is the distance of the phrase's first appearance in the document. The model uses these features to predict whether a candidate is a keyphrase. Kea uses the Naïve Bayes technique for machine learning, which is simple and effective. Kea's performance was evaluated using documents from the New Zealand Digital Library. It was found that Kea can on average match between one and two of the five keyphrases chosen by the author in this collection. Kea works well with a training set of as few as 20 documents and performs best on full text rather than titles and abstracts. The global document corpus for TF×IDF can contain as few as 10 documents. Kea is available from the New Zealand Digital Library project. The authors plan to expand the evaluation of the algorithm in the future, including testing with human expert judges and comparing Kea to other document summarization methods. Kea provides useful metadata where none existed before, offering a valuable tool for digital library designers and users.
Reach us at info@study.space
[slides] KEA%3A practical automatic keyphrase extraction | StudySpace