Understanding A Simple Rule-Based Part of Speech Tagger

This paper presents a simple rule-based part-of-speech (POS) tagger that performs as well as stochastic taggers. The tagger automatically learns its rules and achieves accuracy comparable to stochastic methods. The rule-based tagger has several advantages over stochastic taggers, including reduced storage requirements, clearer rules, easier improvements, and better portability across different tag sets, corpus genres, and languages. The main contribution of this work is demonstrating that stochastic methods are not the only viable approach for POS tagging. The tagger works by initially tagging words based on their most likely tag from a large tagged corpus. It then improves performance by applying patches, which are rule-based corrections derived from errors in the initial tagging. These patches are applied to reduce errors, and the process is repeated until the tagger's performance is optimized. The tagger was tested on 5% of the Brown Corpus, achieving an error rate of 5.1% with 71 patches. This result is comparable to other stochastic taggers, although the exact comparison is difficult due to differences in domains and tag sets. The rule-based tagger can automatically learn to tag idioms, such as "as old as," without requiring hand-crafted rules. The paper concludes that a simple rule-based tagger with few rules can perform as well as stochastic taggers, demonstrating that rule-based methods are a viable alternative for POS tagging. The tagger is highly portable and requires less statistical information than stochastic methods, making it more efficient and easier to adapt to different languages and corpora.This paper presents a simple rule-based part-of-speech (POS) tagger that performs as well as stochastic taggers. The tagger automatically learns its rules and achieves accuracy comparable to stochastic methods. The rule-based tagger has several advantages over stochastic taggers, including reduced storage requirements, clearer rules, easier improvements, and better portability across different tag sets, corpus genres, and languages. The main contribution of this work is demonstrating that stochastic methods are not the only viable approach for POS tagging. The tagger works by initially tagging words based on their most likely tag from a large tagged corpus. It then improves performance by applying patches, which are rule-based corrections derived from errors in the initial tagging. These patches are applied to reduce errors, and the process is repeated until the tagger's performance is optimized. The tagger was tested on 5% of the Brown Corpus, achieving an error rate of 5.1% with 71 patches. This result is comparable to other stochastic taggers, although the exact comparison is difficult due to differences in domains and tag sets. The rule-based tagger can automatically learn to tag idioms, such as "as old as," without requiring hand-crafted rules. The paper concludes that a simple rule-based tagger with few rules can perform as well as stochastic taggers, demonstrating that rule-based methods are a viable alternative for POS tagging. The tagger is highly portable and requires less statistical information than stochastic methods, making it more efficient and easier to adapt to different languages and corpora.

A Simple Rule-Based Part of Speech Tagger

1988 | Eric Brill