This paper introduces Xtract, a tool for retrieving and identifying collocations from large textual corpora. Collocations are recurrent word combinations that occur more frequently than expected by chance and are common in various types of writing, including technical and nontechnical genres. The paper discusses the challenges of handling collocations, such as their arbitrariness, domain dependence, recurrence, and cohesive nature. Xtract is designed to address these challenges by using statistical methods to identify and filter collocations. The tool operates in three stages:
1. **Stage One**: Extracting significant bigrams using statistical measures to identify pairs of words that co-occur frequently and rigidly.
2. **Stage Two**: Converting bigrams into n-grams by analyzing the distribution of words and parts of speech around the bigram to form rigid noun phrases or phrasal templates.
3. **Stage Three**: Adding syntactic information to the collocations identified in Stage One to enhance their functional value for computational tasks.
The paper presents the methodology, implementation, and evaluation of Xtract, including a 10 million-word corpus of stock market news reports. The estimated precision of Xtract is 80%, with a recall of 94%. The techniques described in the paper have been shown to produce richer and more precise output compared to previous methods, making Xtract a valuable tool for lexicographic and computational linguistics applications.This paper introduces Xtract, a tool for retrieving and identifying collocations from large textual corpora. Collocations are recurrent word combinations that occur more frequently than expected by chance and are common in various types of writing, including technical and nontechnical genres. The paper discusses the challenges of handling collocations, such as their arbitrariness, domain dependence, recurrence, and cohesive nature. Xtract is designed to address these challenges by using statistical methods to identify and filter collocations. The tool operates in three stages:
1. **Stage One**: Extracting significant bigrams using statistical measures to identify pairs of words that co-occur frequently and rigidly.
2. **Stage Two**: Converting bigrams into n-grams by analyzing the distribution of words and parts of speech around the bigram to form rigid noun phrases or phrasal templates.
3. **Stage Three**: Adding syntactic information to the collocations identified in Stage One to enhance their functional value for computational tasks.
The paper presents the methodology, implementation, and evaluation of Xtract, including a 10 million-word corpus of stock market news reports. The estimated precision of Xtract is 80%, with a recall of 94%. The techniques described in the paper have been shown to produce richer and more precise output compared to previous methods, making Xtract a valuable tool for lexicographic and computational linguistics applications.