Get Another Label? Improving Data Quality and Data Mining Using Multiple, Noisy Labelers

Get Another Label? Improving Data Quality and Data Mining Using Multiple, Noisy Labelers

August 24-27, 2008 | Victor S. Sheng, Foster Provost, Panagiotis G. Ipeirotis
This paper explores the effectiveness of repeated labeling in improving data quality and model performance, especially in the context of supervised learning. The authors examine how repeated labeling can enhance the quality of training labels, particularly when labels are noisy and obtained from multiple, potentially less expert labelers. They present and analyze various repeated labeling strategies, including majority voting and uncertainty-preserving labeling, and compare their performance with single labeling. Key findings include: 1. **Label Quality Improvement**: Repeated labeling can improve label quality, but the effectiveness depends on the individual labeling quality and the number of labels. 2. **Model Accuracy**: Repeated labeling can be more beneficial than single labeling, especially when the cost of labeling is low compared to the cost of acquiring new data points. 3. **Selective Repeated Labeling**: Selectively choosing data points for repeated labeling based on label uncertainty and model uncertainty can further enhance performance. 4. **Experimental Results**: The authors conduct experiments on 12 real-world datasets, showing that repeated labeling, particularly with uncertainty preservation, consistently improves model accuracy. The paper concludes by highlighting the practical implications of these findings and suggests future directions for research, including the need to consider varying labeling qualities and the potential benefits of combining label and model uncertainty.This paper explores the effectiveness of repeated labeling in improving data quality and model performance, especially in the context of supervised learning. The authors examine how repeated labeling can enhance the quality of training labels, particularly when labels are noisy and obtained from multiple, potentially less expert labelers. They present and analyze various repeated labeling strategies, including majority voting and uncertainty-preserving labeling, and compare their performance with single labeling. Key findings include: 1. **Label Quality Improvement**: Repeated labeling can improve label quality, but the effectiveness depends on the individual labeling quality and the number of labels. 2. **Model Accuracy**: Repeated labeling can be more beneficial than single labeling, especially when the cost of labeling is low compared to the cost of acquiring new data points. 3. **Selective Repeated Labeling**: Selectively choosing data points for repeated labeling based on label uncertainty and model uncertainty can further enhance performance. 4. **Experimental Results**: The authors conduct experiments on 12 real-world datasets, showing that repeated labeling, particularly with uncertainty preservation, consistently improves model accuracy. The paper concludes by highlighting the practical implications of these findings and suggests future directions for research, including the need to consider varying labeling qualities and the potential benefits of combining label and model uncertainty.
Reach us at info@study.space