Understanding Identifying Mislabeled Training Data

This paper presents a novel approach to identifying and eliminating mislabeled training instances for supervised learning, aiming to improve classification accuracy by enhancing the quality of training data. The method uses a set of learning algorithms to create classifiers that act as noise filters for the training data. The authors evaluate single algorithm, majority vote, and consensus filters on five datasets prone to labeling errors, demonstrating that filtering significantly improves classification accuracy up to 30% noise levels. Analytical and empirical evaluations show that consensus filters are conservative, retaining good data while discarding bad data, whereas majority filters are better at detecting bad data but may discard more good data. The study suggests that consensus filters are preferable for datasets with limited data, while majority vote filters are more suitable for datasets with abundant data. The paper also discusses future research directions to minimize the misidentification of exceptions as noise.This paper presents a novel approach to identifying and eliminating mislabeled training instances for supervised learning, aiming to improve classification accuracy by enhancing the quality of training data. The method uses a set of learning algorithms to create classifiers that act as noise filters for the training data. The authors evaluate single algorithm, majority vote, and consensus filters on five datasets prone to labeling errors, demonstrating that filtering significantly improves classification accuracy up to 30% noise levels. Analytical and empirical evaluations show that consensus filters are conservative, retaining good data while discarding bad data, whereas majority filters are better at detecting bad data but may discard more good data. The study suggests that consensus filters are preferable for datasets with limited data, while majority vote filters are more suitable for datasets with abundant data. The paper also discusses future research directions to minimize the misidentification of exceptions as noise.

Identifying Mislabeled Training Data

12/98; published 8/99 | Carla E. Brodley, Mark A. Friedl