This paper presents a method for generating indicative summaries of arbitrary texts using lexical chains as a source representation. The approach does not require full semantic interpretation of the text but instead relies on a model of topic progression derived from lexical chains. The algorithm for computing lexical chains combines several knowledge sources: WordNet thesaurus, part-of-speech tagger, shallow parser for nominal group identification, and a segmentation algorithm based on Hearst's method. The summarization process involves three steps: text segmentation, construction of lexical chains, identification of strong chains, and extraction of significant sentences.
The paper discusses the importance of lexical cohesion in text summarization, highlighting that it captures the "aboutness" of the text. Lexical chains, which represent sequences of semantically related words, are used to identify the main topics of a text. The authors propose a scoring function for chains based on length and homogeneity, which helps identify strong chains that are most representative of the text's main topic.
The paper also presents three heuristics for extracting significant sentences from the text based on chain distribution. The first heuristic selects the sentence containing the first appearance of a chain member, the second selects the sentence containing the first appearance of a representative word, and the third identifies text units where the chain is highly concentrated. The second heuristic is found to produce the best summaries.
The authors also discuss the limitations of their method, including the use of whole sentences as units and the lack of control over summary length and detail. They suggest that future work should focus on improving the method by incorporating additional text features and refining the scoring function. The paper concludes that their method produces summaries of higher quality than those generated by commercial systems.This paper presents a method for generating indicative summaries of arbitrary texts using lexical chains as a source representation. The approach does not require full semantic interpretation of the text but instead relies on a model of topic progression derived from lexical chains. The algorithm for computing lexical chains combines several knowledge sources: WordNet thesaurus, part-of-speech tagger, shallow parser for nominal group identification, and a segmentation algorithm based on Hearst's method. The summarization process involves three steps: text segmentation, construction of lexical chains, identification of strong chains, and extraction of significant sentences.
The paper discusses the importance of lexical cohesion in text summarization, highlighting that it captures the "aboutness" of the text. Lexical chains, which represent sequences of semantically related words, are used to identify the main topics of a text. The authors propose a scoring function for chains based on length and homogeneity, which helps identify strong chains that are most representative of the text's main topic.
The paper also presents three heuristics for extracting significant sentences from the text based on chain distribution. The first heuristic selects the sentence containing the first appearance of a chain member, the second selects the sentence containing the first appearance of a representative word, and the third identifies text units where the chain is highly concentrated. The second heuristic is found to produce the best summaries.
The authors also discuss the limitations of their method, including the use of whole sentences as units and the lack of control over summary length and detail. They suggest that future work should focus on improving the method by incorporating additional text features and refining the scoring function. The paper concludes that their method produces summaries of higher quality than those generated by commercial systems.