Identifying Mislabelled Training Data

Identifying Mislabelled Training Data

1999 | Carla E. Brodley, Mark A. Friedl
This paper presents a new approach to identifying and eliminating mislabeled training instances for supervised learning. The goal is to improve classification accuracy by enhancing the quality of training data. The method uses multiple learning algorithms to create classifiers that act as noise filters for the training data. The paper evaluates single algorithm, majority vote, and consensus filters on five datasets prone to labeling errors. Experiments show that filtering significantly improves classification accuracy up to 30% noise levels. Consensus filters are more conservative, potentially discarding good data, while majority filters are better at detecting bad data. For datasets with limited data, consensus filters are preferable, while majority filters are better for abundant data. The paper discusses related work on handling noise in machine learning, including methods for instance selection and outlier detection. It introduces a general procedure for identifying mislabeled instances using cross-validation and multiple classifiers. Single algorithm filters use the same algorithm for filtering and classification, while ensemble filters combine multiple classifiers. The paper analyzes the error rates of majority and consensus filters, showing that majority filters are more likely to discard good data, while consensus filters are more conservative but may retain bad data. The paper evaluates the approach on five datasets: automated land cover mapping, credit approval, scene segmentation, road segmentation, and fire danger prediction. Each dataset has unique characteristics and labeling challenges. The experiments show that filtering improves classification accuracy for all datasets, with the best results for the land cover and scene segmentation datasets. The paper concludes that filtering is effective in improving classification accuracy, but caution is needed to avoid discarding exceptions rather than noise. Future research aims to improve the ability to distinguish between noise and exceptions.This paper presents a new approach to identifying and eliminating mislabeled training instances for supervised learning. The goal is to improve classification accuracy by enhancing the quality of training data. The method uses multiple learning algorithms to create classifiers that act as noise filters for the training data. The paper evaluates single algorithm, majority vote, and consensus filters on five datasets prone to labeling errors. Experiments show that filtering significantly improves classification accuracy up to 30% noise levels. Consensus filters are more conservative, potentially discarding good data, while majority filters are better at detecting bad data. For datasets with limited data, consensus filters are preferable, while majority filters are better for abundant data. The paper discusses related work on handling noise in machine learning, including methods for instance selection and outlier detection. It introduces a general procedure for identifying mislabeled instances using cross-validation and multiple classifiers. Single algorithm filters use the same algorithm for filtering and classification, while ensemble filters combine multiple classifiers. The paper analyzes the error rates of majority and consensus filters, showing that majority filters are more likely to discard good data, while consensus filters are more conservative but may retain bad data. The paper evaluates the approach on five datasets: automated land cover mapping, credit approval, scene segmentation, road segmentation, and fire danger prediction. Each dataset has unique characteristics and labeling challenges. The experiments show that filtering improves classification accuracy for all datasets, with the best results for the land cover and scene segmentation datasets. The paper concludes that filtering is effective in improving classification accuracy, but caution is needed to avoid discarding exceptions rather than noise. Future research aims to improve the ability to distinguish between noise and exceptions.
Reach us at info@study.space