This paper presents experiments on automatic keyword extraction from abstracts using a supervised machine learning approach. The main contribution is the integration of linguistic knowledge, such as syntactic features, into the representation of terms, which leads to better results compared to relying solely on statistical features like term frequency and n-grams. Extracting noun phrase (NP) chunks yields higher precision than n-grams, and adding part-of-speech (PoS) tags as features significantly improves results, regardless of the term selection method.
The study compares three term selection approaches: n-grams, NP-chunks, and terms matching predefined PoS tag patterns. Four features are used: term frequency, collection frequency, relative position of the first occurrence, and PoS tags. The results show that the NP-chunk approach achieves higher precision, while the PoS pattern approach provides better recall. The highest F-score is achieved by one of the n-gram runs, and the pattern approach without PoS tags assigns the most terms.
The experiments use 2000 English abstracts from the Inspec database, divided into training, validation, and test sets. The results demonstrate that incorporating PoS tags as features significantly improves performance. The best results are achieved with the chunking approach, which extracts NP-chunks and uses PoS tags, achieving an F-score of 33.0. The pattern approach with PoS tags also performs well, achieving an F-score of 28.1.
The study concludes that integrating linguistic knowledge, such as PoS tags, into the feature set is crucial for effective keyword extraction. Future work includes exploring more sophisticated evaluation methods and generating keywords rather than extracting them. The paper also highlights the need for better categorization of PoS tags and the potential benefits of using thesaurus information for keyword generation.This paper presents experiments on automatic keyword extraction from abstracts using a supervised machine learning approach. The main contribution is the integration of linguistic knowledge, such as syntactic features, into the representation of terms, which leads to better results compared to relying solely on statistical features like term frequency and n-grams. Extracting noun phrase (NP) chunks yields higher precision than n-grams, and adding part-of-speech (PoS) tags as features significantly improves results, regardless of the term selection method.
The study compares three term selection approaches: n-grams, NP-chunks, and terms matching predefined PoS tag patterns. Four features are used: term frequency, collection frequency, relative position of the first occurrence, and PoS tags. The results show that the NP-chunk approach achieves higher precision, while the PoS pattern approach provides better recall. The highest F-score is achieved by one of the n-gram runs, and the pattern approach without PoS tags assigns the most terms.
The experiments use 2000 English abstracts from the Inspec database, divided into training, validation, and test sets. The results demonstrate that incorporating PoS tags as features significantly improves performance. The best results are achieved with the chunking approach, which extracts NP-chunks and uses PoS tags, achieving an F-score of 33.0. The pattern approach with PoS tags also performs well, achieving an F-score of 28.1.
The study concludes that integrating linguistic knowledge, such as PoS tags, into the feature set is crucial for effective keyword extraction. Future work includes exploring more sophisticated evaluation methods and generating keywords rather than extracting them. The paper also highlights the need for better categorization of PoS tags and the potential benefits of using thesaurus information for keyword generation.