2001 | Wee Meng Soon, Hwee Tou Ng, Daniel Chung Yong Lim
This paper presents a machine learning approach to coreference resolution of noun phrases in unrestricted text. The approach learns from a small annotated corpus and handles general noun phrases, not just specific types like pronouns. It also allows coreference resolution for any entity type, including organizations and persons. The system is evaluated on the MUC-6 and MUC-7 coreference corpora, achieving accuracy comparable to non-learning approaches. It is the first learning-based system to match the performance of state-of-the-art non-learning systems on these datasets.
Coreference resolution involves identifying whether two expressions refer to the same entity. The paper focuses on the MUC-6 and MUC-7 coreference tasks, which define coreference relations between textual elements like definite noun phrases, demonstrative noun phrases, proper names, and pronouns. The system uses a pipeline of natural language processing modules to identify markables, which are then used to generate feature vectors for training a classifier.
The system uses 12 features to determine coreference, including distance between markables, pronoun status, string match, semantic class agreement, and alias detection. Training examples are generated by considering adjacent noun phrases in coreference chains as positive examples and other markables as negative examples. A decision tree classifier, C5, is used to learn coreference rules from these features.
The system is tested on MUC-6 and MUC-7 datasets, achieving a recall of 58.6% and 56.1%, and precision of 67.3% and 65.5%, respectively. The balanced F-measure for MUC-6 is 62.6%, and for MUC-7 is 60.4%. The system outperforms several other systems, including the RESOLVE system, which uses a non-learning approach. The system's performance is influenced by the accuracy of its NLP modules, particularly the named entity recognition and noun phrase identification modules.
The paper also analyzes errors in the system, including spurious links caused by surface string matches and missing links due to inadequate features or incorrect noun phrase identification. The system's features are shown to be effective, with the ALIAS, STR_MATCH, and APPOSITIVE features contributing significantly to coreference resolution. The system's performance is further validated by learning curves, showing that it achieves peak performance with around 25 training documents.This paper presents a machine learning approach to coreference resolution of noun phrases in unrestricted text. The approach learns from a small annotated corpus and handles general noun phrases, not just specific types like pronouns. It also allows coreference resolution for any entity type, including organizations and persons. The system is evaluated on the MUC-6 and MUC-7 coreference corpora, achieving accuracy comparable to non-learning approaches. It is the first learning-based system to match the performance of state-of-the-art non-learning systems on these datasets.
Coreference resolution involves identifying whether two expressions refer to the same entity. The paper focuses on the MUC-6 and MUC-7 coreference tasks, which define coreference relations between textual elements like definite noun phrases, demonstrative noun phrases, proper names, and pronouns. The system uses a pipeline of natural language processing modules to identify markables, which are then used to generate feature vectors for training a classifier.
The system uses 12 features to determine coreference, including distance between markables, pronoun status, string match, semantic class agreement, and alias detection. Training examples are generated by considering adjacent noun phrases in coreference chains as positive examples and other markables as negative examples. A decision tree classifier, C5, is used to learn coreference rules from these features.
The system is tested on MUC-6 and MUC-7 datasets, achieving a recall of 58.6% and 56.1%, and precision of 67.3% and 65.5%, respectively. The balanced F-measure for MUC-6 is 62.6%, and for MUC-7 is 60.4%. The system outperforms several other systems, including the RESOLVE system, which uses a non-learning approach. The system's performance is influenced by the accuracy of its NLP modules, particularly the named entity recognition and noun phrase identification modules.
The paper also analyzes errors in the system, including spurious links caused by surface string matches and missing links due to inadequate features or incorrect noun phrase identification. The system's features are shown to be effective, with the ALIAS, STR_MATCH, and APPOSITIVE features contributing significantly to coreference resolution. The system's performance is further validated by learning curves, showing that it achieves peak performance with around 25 training documents.