2002 | Ivan A. Sag, Timothy Baldwin, Francis Bond, Ann Copostake, Dan Flickinger
Multiword expressions (MWEs) pose significant challenges for natural language processing (NLP). This paper discusses the problem and available analytic techniques. MWEs should be analyzed in different ways, including listing "words with spaces", hierarchically organized lexicons, restricted combinatoric rules, lexical selection, "idiomatic constructions", and simple statistical affinity. An adequate analysis of MWEs must use both symbolic and statistical techniques.
The tension between symbolic and statistical methods is evident in NLP. While some believe statistical methods have made linguistic analysis unnecessary, this is not the case. Modern statistical NLP needs better language models. At the same time, deep processing has crossed the industrial threshold and is used in many applications. However, deep analysis must address two key problems: disambiguation and MWEs.
Disambiguation is a major issue, as linguistic precision is inversely related to sentence ambiguity. Knowledge representation has not provided satisfactory solutions. Many researchers are now exploring stochastic methods for ambiguity resolution.
The problem of MWEs is underappreciated. MWEs are "idiosyncratic interpretations that cross word boundaries". Jackendoff estimates that the number of MWEs in a speaker's lexicon is similar to the number of single words. In WordNet 1.7, 41% of entries are MWEs. Specialized domain vocabulary is largely composed of MWEs, and systems must handle many domains.
MWEs appear in all text genres and pose significant problems for NLP. If treated by general methods, there is an overgeneration problem. For example, a system might generate unacceptable examples like "telephone cabinet". There is also an idiomaticity problem: expressions like "kick the bucket" have meanings unrelated to their components.
Many treat MWEs as "words with spaces", but this approach has limitations. It suffers from a flexibility problem. A parser might correctly assign multiple interpretations but fail to recognize idiomatic expressions. MWEs require more sophisticated analysis.Multiword expressions (MWEs) pose significant challenges for natural language processing (NLP). This paper discusses the problem and available analytic techniques. MWEs should be analyzed in different ways, including listing "words with spaces", hierarchically organized lexicons, restricted combinatoric rules, lexical selection, "idiomatic constructions", and simple statistical affinity. An adequate analysis of MWEs must use both symbolic and statistical techniques.
The tension between symbolic and statistical methods is evident in NLP. While some believe statistical methods have made linguistic analysis unnecessary, this is not the case. Modern statistical NLP needs better language models. At the same time, deep processing has crossed the industrial threshold and is used in many applications. However, deep analysis must address two key problems: disambiguation and MWEs.
Disambiguation is a major issue, as linguistic precision is inversely related to sentence ambiguity. Knowledge representation has not provided satisfactory solutions. Many researchers are now exploring stochastic methods for ambiguity resolution.
The problem of MWEs is underappreciated. MWEs are "idiosyncratic interpretations that cross word boundaries". Jackendoff estimates that the number of MWEs in a speaker's lexicon is similar to the number of single words. In WordNet 1.7, 41% of entries are MWEs. Specialized domain vocabulary is largely composed of MWEs, and systems must handle many domains.
MWEs appear in all text genres and pose significant problems for NLP. If treated by general methods, there is an overgeneration problem. For example, a system might generate unacceptable examples like "telephone cabinet". There is also an idiomaticity problem: expressions like "kick the bucket" have meanings unrelated to their components.
Many treat MWEs as "words with spaces", but this approach has limitations. It suffers from a flexibility problem. A parser might correctly assign multiple interpretations but fail to recognize idiomatic expressions. MWEs require more sophisticated analysis.