Sequential PATTERN Mining using A Bitmap Representation

Sequential PATTERN Mining using A Bitmap Representation

2002 | Jay Ayres, Johannes Gehrke, Tomi Yiu, and Jason Flannick
This paper introduces a new algorithm for mining sequential patterns called SPAM (Sequential PAttern Mining). The algorithm is particularly efficient when the sequential patterns in the database are very long. It uses a depth-first search strategy that integrates a depth-first traversal of the search space with effective pruning mechanisms. The algorithm combines a vertical bitmap representation of the database with efficient support counting. A key feature of SPAM is that it incrementally outputs new frequent itemsets in an online fashion. The algorithm is based on a lexicographic tree of sequences. Each node in the tree can generate sequence-extended and itemset-extended sequences. The algorithm traverses the tree in a depth-first manner, testing the support of each generated sequence. If the support is greater than or equal to the minimum support threshold, the sequence is stored and the algorithm continues recursively. If not, the algorithm prunes the search space using Apriori-based techniques to minimize the size of candidate sets at each node. The algorithm uses a vertical bitmap representation of the data, where each item has a bitmap representing its presence in each transaction. This allows for efficient counting and candidate generation. The algorithm also includes S-step and I-step pruning techniques to further reduce the search space. The algorithm was evaluated on standard benchmark datasets and outperformed previous algorithms such as SPADE and PrefixSpan by up to an order of magnitude. SPAM is particularly effective for large datasets due to its efficient counting and pruning mechanisms. However, it is less space-efficient than SPADE. The choice between SPAM and SPADE is a space-time tradeoff. The algorithm is efficient in terms of runtime and is suitable for finding frequent sequences in large datasets.This paper introduces a new algorithm for mining sequential patterns called SPAM (Sequential PAttern Mining). The algorithm is particularly efficient when the sequential patterns in the database are very long. It uses a depth-first search strategy that integrates a depth-first traversal of the search space with effective pruning mechanisms. The algorithm combines a vertical bitmap representation of the database with efficient support counting. A key feature of SPAM is that it incrementally outputs new frequent itemsets in an online fashion. The algorithm is based on a lexicographic tree of sequences. Each node in the tree can generate sequence-extended and itemset-extended sequences. The algorithm traverses the tree in a depth-first manner, testing the support of each generated sequence. If the support is greater than or equal to the minimum support threshold, the sequence is stored and the algorithm continues recursively. If not, the algorithm prunes the search space using Apriori-based techniques to minimize the size of candidate sets at each node. The algorithm uses a vertical bitmap representation of the data, where each item has a bitmap representing its presence in each transaction. This allows for efficient counting and candidate generation. The algorithm also includes S-step and I-step pruning techniques to further reduce the search space. The algorithm was evaluated on standard benchmark datasets and outperformed previous algorithms such as SPADE and PrefixSpan by up to an order of magnitude. SPAM is particularly effective for large datasets due to its efficient counting and pruning mechanisms. However, it is less space-efficient than SPADE. The choice between SPAM and SPADE is a space-time tradeoff. The algorithm is efficient in terms of runtime and is suitable for finding frequent sequences in large datasets.
Reach us at info@study.space
[slides] Sequential PAttern mining using a bitmap representation | StudySpace