[slides and audio] Mining high-speed data streams

This paper introduces Hoeffding trees, a decision tree learning method that allows for efficient learning from high-speed data streams. The method uses Hoeffding bounds to ensure that the output is asymptotically nearly identical to that of a conventional batch learner. VFDT, a decision-tree learning system based on Hoeffding trees, is described and evaluated. VFDT is designed to process data in constant time per example and requires only a small constant amount of memory. It can incorporate tens of thousands of examples per second using off-the-shelf hardware and is an anytime algorithm, meaning a usable model is available after the first few examples. VFDT is I/O bound, meaning it mines examples in less time than it takes to input them from disk. It does not store any examples in main memory, requiring only space proportional to the size of the tree and associated sufficient statistics. VFDT can learn by seeing each example only once, and therefore does not require examples from an online stream to ever be stored. The paper also discusses the properties of Hoeffding trees and their application to mining continuous streams of Web access data from the University of Washington main campus. The study shows that VFDT can effectively handle large data streams and outperforms traditional methods like C4.5 in terms of accuracy and efficiency. VFDT is able to process a large number of examples quickly and maintain high accuracy even with noisy data. The paper also discusses the use of VFDT for Web log data, showing that it can be applied to real-time data mining tasks. The paper concludes with a discussion of related work and future directions for research.This paper introduces Hoeffding trees, a decision tree learning method that allows for efficient learning from high-speed data streams. The method uses Hoeffding bounds to ensure that the output is asymptotically nearly identical to that of a conventional batch learner. VFDT, a decision-tree learning system based on Hoeffding trees, is described and evaluated. VFDT is designed to process data in constant time per example and requires only a small constant amount of memory. It can incorporate tens of thousands of examples per second using off-the-shelf hardware and is an anytime algorithm, meaning a usable model is available after the first few examples. VFDT is I/O bound, meaning it mines examples in less time than it takes to input them from disk. It does not store any examples in main memory, requiring only space proportional to the size of the tree and associated sufficient statistics. VFDT can learn by seeing each example only once, and therefore does not require examples from an online stream to ever be stored. The paper also discusses the properties of Hoeffding trees and their application to mining continuous streams of Web access data from the University of Washington main campus. The study shows that VFDT can effectively handle large data streams and outperforms traditional methods like C4.5 in terms of accuracy and efficiency. VFDT is able to process a large number of examples quickly and maintain high accuracy even with noisy data. The paper also discusses the use of VFDT for Web log data, showing that it can be applied to real-time data mining tasks. The paper concludes with a discussion of related work and future directions for research.

Mining High-Speed Data Streams

2000-08-08 | Pedro Domingos, Geoff Hulten