November 1996 | Usama Fayyad, Gregory Piatetsky-Shapiro, and Padhraic Smyth
The KDD process is essential for extracting useful knowledge from large volumes of data. As data overload becomes a growing challenge, new computational techniques and tools are needed to support knowledge discovery in databases (KDD) and data mining. KDD involves a series of steps, including data preparation, data selection, data cleaning, and interpretation of results, to ensure that useful knowledge is derived from data. Data mining is a key part of the KDD process, involving the application of specific algorithms to extract patterns from data.
The KDD process is interactive and iterative, involving multiple steps such as learning the application domain, creating a target dataset, data cleaning and preprocessing, data reduction and projection, choosing the function of data mining, selecting data mining algorithms, data mining, interpretation, and using discovered knowledge. Each step is crucial for the successful application of KDD in practice.
Data mining involves fitting models to or determining patterns from observed data. The models play the role of inferred knowledge, and the decision of whether they reflect useful knowledge is part of the overall interactive KDD process. A wide variety of data mining algorithms are described in the literature, ranging from statistics, pattern recognition, machine learning, and databases. Most data mining algorithms can be viewed as compositions of a few basic techniques and principles.
Model functions include classification, regression, clustering, summarization, dependency modeling, link analysis, and sequence analysis. Model representations include decision trees, linear models, nonlinear models, example-based methods, probabilistic graphical dependency models, and relational attribute models. Model preference criteria determine how well a particular model and its parameters meet the criteria of the KDD process.
Research issues and challenges in KDD include handling massive datasets and high dimensionality, user interaction and prior knowledge, overfitting and assessing statistical significance, missing data, understandability of patterns, managing changing data and knowledge, integration, nonstandard, multimedia, and object-oriented data. Despite these challenges, KDD has achieved some successes and is driven by strong social and economic needs. The field remains in its infancy, with many challenges to overcome, but it has the potential to provide valuable insights and tools for managing and analyzing large volumes of data.The KDD process is essential for extracting useful knowledge from large volumes of data. As data overload becomes a growing challenge, new computational techniques and tools are needed to support knowledge discovery in databases (KDD) and data mining. KDD involves a series of steps, including data preparation, data selection, data cleaning, and interpretation of results, to ensure that useful knowledge is derived from data. Data mining is a key part of the KDD process, involving the application of specific algorithms to extract patterns from data.
The KDD process is interactive and iterative, involving multiple steps such as learning the application domain, creating a target dataset, data cleaning and preprocessing, data reduction and projection, choosing the function of data mining, selecting data mining algorithms, data mining, interpretation, and using discovered knowledge. Each step is crucial for the successful application of KDD in practice.
Data mining involves fitting models to or determining patterns from observed data. The models play the role of inferred knowledge, and the decision of whether they reflect useful knowledge is part of the overall interactive KDD process. A wide variety of data mining algorithms are described in the literature, ranging from statistics, pattern recognition, machine learning, and databases. Most data mining algorithms can be viewed as compositions of a few basic techniques and principles.
Model functions include classification, regression, clustering, summarization, dependency modeling, link analysis, and sequence analysis. Model representations include decision trees, linear models, nonlinear models, example-based methods, probabilistic graphical dependency models, and relational attribute models. Model preference criteria determine how well a particular model and its parameters meet the criteria of the KDD process.
Research issues and challenges in KDD include handling massive datasets and high dimensionality, user interaction and prior knowledge, overfitting and assessing statistical significance, missing data, understandability of patterns, managing changing data and knowledge, integration, nonstandard, multimedia, and object-oriented data. Despite these challenges, KDD has achieved some successes and is driven by strong social and economic needs. The field remains in its infancy, with many challenges to overcome, but it has the potential to provide valuable insights and tools for managing and analyzing large volumes of data.