This paper presents a statistical method for retrieving and identifying collocations from large textual corpora, implemented in a tool called Xtract. Collocations are recurrent word combinations that co-occur more often than expected by chance and are essential for lexicography and natural language processing. Previous methods for retrieving collocations often produced improper associations due to spurious patterns in the training corpus. Xtract improves upon these methods by using original filtering techniques to produce more accurate and relevant collocations.
Xtract operates in three stages. In the first stage, pairwise lexical relations are identified using statistical methods. These relations are then passed to the second and third stages. In the second stage, multiple-word combinations and complex expressions are identified. In the third stage, parsing and statistical techniques are combined to label and filter collocations, increasing the precision of Xtract from 40% to 80% with a recall of 94%.
The paper discusses four properties of collocations: they are arbitrary, domain-dependent, recurrent, and cohesive lexical clusters. Collocations vary in the number of words involved, syntactic categories, and how rigidly words are used together. Examples include rigid noun phrases like "The New York Stock Exchange," predicative relations like "make-decision," and phrasal templates like "the Dow Jones industrial average."
Xtract retrieves three types of collocations: rigid noun phrases, predicative relations, and phrasal templates. It has been tested on a 10 million-word corpus of stock market news reports and has shown high precision in retrieving collocations. The paper also discusses related work in lexicography and computational linguistics, highlighting the importance of collocations in language generation, translation, and other applications. Xtract's three-stage process allows for the retrieval of a wide range of collocations with high performance, making it a valuable tool for lexicographic and computational tasks.This paper presents a statistical method for retrieving and identifying collocations from large textual corpora, implemented in a tool called Xtract. Collocations are recurrent word combinations that co-occur more often than expected by chance and are essential for lexicography and natural language processing. Previous methods for retrieving collocations often produced improper associations due to spurious patterns in the training corpus. Xtract improves upon these methods by using original filtering techniques to produce more accurate and relevant collocations.
Xtract operates in three stages. In the first stage, pairwise lexical relations are identified using statistical methods. These relations are then passed to the second and third stages. In the second stage, multiple-word combinations and complex expressions are identified. In the third stage, parsing and statistical techniques are combined to label and filter collocations, increasing the precision of Xtract from 40% to 80% with a recall of 94%.
The paper discusses four properties of collocations: they are arbitrary, domain-dependent, recurrent, and cohesive lexical clusters. Collocations vary in the number of words involved, syntactic categories, and how rigidly words are used together. Examples include rigid noun phrases like "The New York Stock Exchange," predicative relations like "make-decision," and phrasal templates like "the Dow Jones industrial average."
Xtract retrieves three types of collocations: rigid noun phrases, predicative relations, and phrasal templates. It has been tested on a 10 million-word corpus of stock market news reports and has shown high precision in retrieving collocations. The paper also discusses related work in lexicography and computational linguistics, highlighting the importance of collocations in language generation, translation, and other applications. Xtract's three-stage process allows for the retrieval of a wide range of collocations with high performance, making it a valuable tool for lexicographic and computational tasks.