[slides] A Survey of Text Similarity Approaches

This survey discusses text similarity approaches, categorizing them into String-based, Corpus-based, and Knowledge-based methods. Text similarity is crucial for tasks like information retrieval, document clustering, and machine translation. String-based similarity measures compare text based on character sequences, while Corpus-based similarity uses large text corpora to determine semantic relationships. Knowledge-based similarity relies on semantic networks, such as WordNet, to assess word relationships. String-based methods include algorithms like Longest Common Substring, Damerau-Levenshtein, Jaro, Jaro-Winkler, Needleman-Wunsch, Smith-Waterman, and N-gram. These methods focus on character-level or term-level comparisons. Corpus-based methods include HAL, LSA, GLSA, ESA, CL-ESA, PMI-IR, SCO-PMI, NGD, and DISCO, which use statistical analysis of large text corpora to determine semantic similarity. Knowledge-based methods include Resnik, Lin, Jiang & Conrath, Leacock & Chodorow, Wu & Palmer, and Path, which use semantic networks to measure word similarity. Hybrid methods combine multiple similarity measures to improve accuracy. For example, combining corpus-based and knowledge-based methods has shown better performance in sentence similarity tasks. The survey also highlights useful similarity packages such as SimMetrics, WordNet::Similarity, and NLTK. The authors are Wael H. Gomaa and Aly A. Fahmy, both experts in computer science and artificial intelligence. Gomaa is a Ph.D. student in Automatic Assessment, while Fahmy is a professor specializing in Artificial Intelligence and Machine Learning. Their research focuses on text mining, natural language processing, and data mining.This survey discusses text similarity approaches, categorizing them into String-based, Corpus-based, and Knowledge-based methods. Text similarity is crucial for tasks like information retrieval, document clustering, and machine translation. String-based similarity measures compare text based on character sequences, while Corpus-based similarity uses large text corpora to determine semantic relationships. Knowledge-based similarity relies on semantic networks, such as WordNet, to assess word relationships. String-based methods include algorithms like Longest Common Substring, Damerau-Levenshtein, Jaro, Jaro-Winkler, Needleman-Wunsch, Smith-Waterman, and N-gram. These methods focus on character-level or term-level comparisons. Corpus-based methods include HAL, LSA, GLSA, ESA, CL-ESA, PMI-IR, SCO-PMI, NGD, and DISCO, which use statistical analysis of large text corpora to determine semantic similarity. Knowledge-based methods include Resnik, Lin, Jiang & Conrath, Leacock & Chodorow, Wu & Palmer, and Path, which use semantic networks to measure word similarity. Hybrid methods combine multiple similarity measures to improve accuracy. For example, combining corpus-based and knowledge-based methods has shown better performance in sentence similarity tasks. The survey also highlights useful similarity packages such as SimMetrics, WordNet::Similarity, and NLTK. The authors are Wael H. Gomaa and Aly A. Fahmy, both experts in computer science and artificial intelligence. Gomaa is a Ph.D. student in Automatic Assessment, while Fahmy is a professor specializing in Artificial Intelligence and Machine Learning. Their research focuses on text mining, natural language processing, and data mining.

A Survey of Text Similarity Approaches

April 2013 | Wael H. Gomaa, Aly A. Fahmy