TextTiling is a technique for dividing texts into multi-paragraph units that represent subtopics. It uses patterns of lexical co-occurrence and distribution to identify subtopic shifts. The algorithm is fully implemented and produces segmentation that aligns well with human judgments of subtopic boundaries in 12 texts. Multi-paragraph subtopic segmentation is useful for tasks like information retrieval and summarization.
The article describes a paragraph-level model of discourse structure based on subtopic shifts and an algorithm for subdividing expository texts into multi-paragraph "passages" or subtopic segments. Expository texts are characterized as sequences of subtopic discussions within main topics. The algorithm uses lexical co-occurrence and distribution patterns to determine subtopic boundaries. Three scoring methods are explored: blocks, vocabulary introductions, and chains, though only the first two are evaluated in this article.
The article discusses the need for algorithms that can detect multi-paragraph subtopic structure, with applications in hypertext display, information retrieval, and text summarization. It also explores the relationship between subtopic structure and hierarchical discourse models, and how lexical co-occurrence patterns can be used to detect subtopic changes. The TextTiling algorithm is described in detail, along with its performance in information retrieval tasks.
The article also discusses other related approaches, such as vector space similarity comparisons and lexical chain analysis, and their limitations. The TextTiling algorithm is designed to detect subtopic boundaries by identifying where lexical patterns change significantly. It uses a combination of lexical co-occurrence and distribution to determine subtopic shifts, and is evaluated in terms of its effectiveness in information retrieval tasks. The algorithm is compared to other methods, and its performance is assessed in terms of its ability to detect subtopic boundaries in expository texts.TextTiling is a technique for dividing texts into multi-paragraph units that represent subtopics. It uses patterns of lexical co-occurrence and distribution to identify subtopic shifts. The algorithm is fully implemented and produces segmentation that aligns well with human judgments of subtopic boundaries in 12 texts. Multi-paragraph subtopic segmentation is useful for tasks like information retrieval and summarization.
The article describes a paragraph-level model of discourse structure based on subtopic shifts and an algorithm for subdividing expository texts into multi-paragraph "passages" or subtopic segments. Expository texts are characterized as sequences of subtopic discussions within main topics. The algorithm uses lexical co-occurrence and distribution patterns to determine subtopic boundaries. Three scoring methods are explored: blocks, vocabulary introductions, and chains, though only the first two are evaluated in this article.
The article discusses the need for algorithms that can detect multi-paragraph subtopic structure, with applications in hypertext display, information retrieval, and text summarization. It also explores the relationship between subtopic structure and hierarchical discourse models, and how lexical co-occurrence patterns can be used to detect subtopic changes. The TextTiling algorithm is described in detail, along with its performance in information retrieval tasks.
The article also discusses other related approaches, such as vector space similarity comparisons and lexical chain analysis, and their limitations. The TextTiling algorithm is designed to detect subtopic boundaries by identifying where lexical patterns change significantly. It uses a combination of lexical co-occurrence and distribution to determine subtopic shifts, and is evaluated in terms of its effectiveness in information retrieval tasks. The algorithm is compared to other methods, and its performance is assessed in terms of its ability to detect subtopic boundaries in expository texts.