To appear in the special issue of Artificial Intelligence on 'Relevance' | Avrim L. Blum, Pat Langley
This paper reviews methods for handling large datasets with irrelevant information, focusing on two key issues: selecting relevant features and selecting relevant examples. The authors describe advances in empirical and theoretical work, presenting a general framework to compare different methods. They begin by discussing the problem of irrelevant features, defining relevance and reviewing various feature selection algorithms, including embedded, filter, and wrapper approaches. They then turn to the problem of irrelevant examples, describing methods for filtering both labeled and unlabeled data. The paper concludes with open challenges for future research in both empirical and theoretical fronts.This paper reviews methods for handling large datasets with irrelevant information, focusing on two key issues: selecting relevant features and selecting relevant examples. The authors describe advances in empirical and theoretical work, presenting a general framework to compare different methods. They begin by discussing the problem of irrelevant features, defining relevance and reviewing various feature selection algorithms, including embedded, filter, and wrapper approaches. They then turn to the problem of irrelevant examples, describing methods for filtering both labeled and unlabeled data. The paper concludes with open challenges for future research in both empirical and theoretical fronts.