[slides and audio] Unsupervised Learning of the Morphology of a Natural Language

This study presents the results of using minimum description length (MDL) analysis to model unsupervised learning of the morphological segmentation of European languages, using corpora ranging from 5,000 to 500,000 words. The research develops heuristics to rapidly build a probabilistic morphological grammar and uses MDL to determine whether proposed modifications should be adopted. The resulting grammar aligns well with human morphological analysis. The study discusses the relationship between MDL grammatical analysis and evaluation metrics in early generative grammar. The paper explores unsupervised acquisition of morphology, focusing on word segmentation into morphemes. The program takes a text file as input and produces a partial morphological analysis of most words in the corpus, aiming to match human analysis. It performs unsupervised learning by using only the corpus as input, without dictionaries or language-specific rules. The goal is to correctly analyze words into morphemes, though with only basic categorical labeling. The underlying model uses MDL principles, which focus on optimal data representation and compression. The novelty lies in using simple morphological pattern statements (signatures) to quantify MDL and construct a morphological grammar. The system sets high goals, reformulating traditional morphological analysis strategies in algorithmic terms. Unsupervised learning offers theoretical and practical benefits, including a complete relationship between data and analysis, and the potential for fully automated morphology generation. This is particularly useful for European languages where manual morphology creation is time-consuming. The project also serves as a preparatory phase for unsupervised grammar acquisition systems. Previous research includes approaches like conditional entropy-based methods, bigram/trigram analysis, phonological pattern discovery, and top-down MDL analysis. The study compares these methods, finding that local peaks in conditional entropy can identify morpheme boundaries, though with limitations. Other approaches, such as those using labeled word pairs or clustering, have also been explored. The study introduces a C++ program called Linguistica that analyzes corpora, with results showing high precision and recall on test data. The program has been tested on multiple languages, including English, French, German, Spanish, Italian, Dutch, Latin, and Russian. The paper discusses the MDL model, heuristics for initial word splitting, resulting signatures, MDL use in morphology search, results, spurious generalizations, signature grouping, and future improvements. It also speculates on the broader implications of the work and ongoing research.This study presents the results of using minimum description length (MDL) analysis to model unsupervised learning of the morphological segmentation of European languages, using corpora ranging from 5,000 to 500,000 words. The research develops heuristics to rapidly build a probabilistic morphological grammar and uses MDL to determine whether proposed modifications should be adopted. The resulting grammar aligns well with human morphological analysis. The study discusses the relationship between MDL grammatical analysis and evaluation metrics in early generative grammar. The paper explores unsupervised acquisition of morphology, focusing on word segmentation into morphemes. The program takes a text file as input and produces a partial morphological analysis of most words in the corpus, aiming to match human analysis. It performs unsupervised learning by using only the corpus as input, without dictionaries or language-specific rules. The goal is to correctly analyze words into morphemes, though with only basic categorical labeling. The underlying model uses MDL principles, which focus on optimal data representation and compression. The novelty lies in using simple morphological pattern statements (signatures) to quantify MDL and construct a morphological grammar. The system sets high goals, reformulating traditional morphological analysis strategies in algorithmic terms. Unsupervised learning offers theoretical and practical benefits, including a complete relationship between data and analysis, and the potential for fully automated morphology generation. This is particularly useful for European languages where manual morphology creation is time-consuming. The project also serves as a preparatory phase for unsupervised grammar acquisition systems. Previous research includes approaches like conditional entropy-based methods, bigram/trigram analysis, phonological pattern discovery, and top-down MDL analysis. The study compares these methods, finding that local peaks in conditional entropy can identify morpheme boundaries, though with limitations. Other approaches, such as those using labeled word pairs or clustering, have also been explored. The study introduces a C++ program called Linguistica that analyzes corpora, with results showing high precision and recall on test data. The program has been tested on multiple languages, including English, French, German, Spanish, Italian, Dutch, Latin, and Russian. The paper discusses the MDL model, heuristics for initial word splitting, resulting signatures, MDL use in morphology search, results, spurious generalizations, signature grouping, and future improvements. It also speculates on the broader implications of the work and ongoing research.

Unsupervised Learning of the Morphology of a Natural Language

2001 | John Goldsmith