[slides and audio] A trainable document summarizer

This paper presents a trainable document summarizer based on a statistical framework. The goal is to generate concise summaries that are more informative than titles but short enough to be absorbed in a single glance. The approach involves selecting a subset of sentences from a document that are indicative of its content, typically by scoring sentences and presenting those with the best scores. The summarizer is trained using a corpus of document/summary pairs, where each summary is a hand-selected extract from the original document. The summarizer uses a statistical classification approach to determine the probability that a given sentence is included in a summary. This is done by computing the probability of the sentence given the features, and then using Bayes' rule to calculate the likelihood of the sentence being part of a summary. The features used include sentence length, fixed phrases, paragraph structure, thematic words, and uppercase words. The training corpus consists of 188 document/summary pairs from 21 scientific/technical journals. The summaries are mainly indicative and have an average length of three sentences. The documents were originally in the form of photocopies, which required scanning and OCR to extract their text. This process introduced spelling errors and occasional omissions of text, which were manually checked and corrected. The summarizer was evaluated using cross-validation, where documents from a given journal were tested one at a time, and all other document/summary pairs were used for training. The results showed that the summarizer correctly identified 42% of the matchable sentences. The performance was also evaluated in terms of the fraction of manual summary sentences that were faithfully reproduced by the summarizer, which was 83%. The results indicate that the summarizer can generate summaries that are as informative as the full text of a document, with the best performance achieved when using a combination of features. The summarizer was also compared to a baseline approach of simply selecting sentences from the beginning of a document, which resulted in a 74% improvement in performance. The summarizer was also shown to be able to generate summaries that are useful for rapid relevance assessment while browsing.This paper presents a trainable document summarizer based on a statistical framework. The goal is to generate concise summaries that are more informative than titles but short enough to be absorbed in a single glance. The approach involves selecting a subset of sentences from a document that are indicative of its content, typically by scoring sentences and presenting those with the best scores. The summarizer is trained using a corpus of document/summary pairs, where each summary is a hand-selected extract from the original document. The summarizer uses a statistical classification approach to determine the probability that a given sentence is included in a summary. This is done by computing the probability of the sentence given the features, and then using Bayes' rule to calculate the likelihood of the sentence being part of a summary. The features used include sentence length, fixed phrases, paragraph structure, thematic words, and uppercase words. The training corpus consists of 188 document/summary pairs from 21 scientific/technical journals. The summaries are mainly indicative and have an average length of three sentences. The documents were originally in the form of photocopies, which required scanning and OCR to extract their text. This process introduced spelling errors and occasional omissions of text, which were manually checked and corrected. The summarizer was evaluated using cross-validation, where documents from a given journal were tested one at a time, and all other document/summary pairs were used for training. The results showed that the summarizer correctly identified 42% of the matchable sentences. The performance was also evaluated in terms of the fraction of manual summary sentences that were faithfully reproduced by the summarizer, which was 83%. The results indicate that the summarizer can generate summaries that are as informative as the full text of a document, with the best performance achieved when using a combination of features. The summarizer was also compared to a baseline approach of simply selecting sentences from the beginning of a document, which resulted in a 74% improvement in performance. The summarizer was also shown to be able to generate summaries that are useful for rapid relevance assessment while browsing.

A Trainable Document Summarizer

1995 | Julian Kupiec, Jan Pedersen and Francine Chen