June 1975 | Alfred V. Aho and Margaret J. Corasick
This paper presents an efficient algorithm for locating all occurrences of any of a finite number of keywords in a text string. The algorithm constructs a finite state pattern matching machine from the keywords and processes the text string in a single pass. The construction of the machine takes time proportional to the sum of the lengths of the keywords, and the number of state transitions during processing is independent of the number of keywords. The algorithm was used to improve the speed of a library bibliographic search program by a factor of 5 to 10.
The algorithm combines ideas from the Knuth-Morris-Pratt algorithm and finite state machines. It uses three functions: a goto function, a failure function, and an output function. The goto function determines transitions between states, the failure function handles cases where a transition fails, and the output function identifies keywords found.
The construction of the goto, failure, and output functions is described. The goto function is built by adding paths for each keyword, while the failure function is computed based on the goto function. The output function is updated during the computation of the failure function.
The algorithm's time complexity is analyzed, showing that the number of state transitions is independent of the number of keywords. Algorithms 2 and 3 are shown to run in linear time proportional to the sum of the lengths of the keywords.
The algorithm is applied to a bibliographic search program, where it significantly improves search performance by allowing simultaneous matching of multiple keywords in a single pass through the text. The algorithm is also shown to be efficient in terms of memory and processing time, and can be implemented using a deterministic finite automaton to eliminate failure transitions.
The paper concludes that the algorithm is well-suited for applications involving large numbers of keywords in text strings, and that it provides an efficient solution for pattern matching and information retrieval.This paper presents an efficient algorithm for locating all occurrences of any of a finite number of keywords in a text string. The algorithm constructs a finite state pattern matching machine from the keywords and processes the text string in a single pass. The construction of the machine takes time proportional to the sum of the lengths of the keywords, and the number of state transitions during processing is independent of the number of keywords. The algorithm was used to improve the speed of a library bibliographic search program by a factor of 5 to 10.
The algorithm combines ideas from the Knuth-Morris-Pratt algorithm and finite state machines. It uses three functions: a goto function, a failure function, and an output function. The goto function determines transitions between states, the failure function handles cases where a transition fails, and the output function identifies keywords found.
The construction of the goto, failure, and output functions is described. The goto function is built by adding paths for each keyword, while the failure function is computed based on the goto function. The output function is updated during the computation of the failure function.
The algorithm's time complexity is analyzed, showing that the number of state transitions is independent of the number of keywords. Algorithms 2 and 3 are shown to run in linear time proportional to the sum of the lengths of the keywords.
The algorithm is applied to a bibliographic search program, where it significantly improves search performance by allowing simultaneous matching of multiple keywords in a single pass through the text. The algorithm is also shown to be efficient in terms of memory and processing time, and can be implemented using a deterministic finite automaton to eliminate failure transitions.
The paper concludes that the algorithm is well-suited for applications involving large numbers of keywords in text strings, and that it provides an efficient solution for pattern matching and information retrieval.