August 24-27, 2008, Las Vegas, Nevada, USA | Victor S. Sheng, Foster Provost, Panagiotis G. Ipeirotis
This paper addresses the problem of acquiring multiple, noisy labels for data items when labeling is imperfect. It examines how repeated labeling can improve data quality and model performance, especially in supervised learning. With the increasing availability of low-cost labeling through platforms like Mechanical Turk, obtaining non-expert labels has become feasible. However, the cost of labeling can be significant compared to the cost of preprocessing unlabeled data. The paper presents strategies for repeated labeling, showing that even with noisy labels, repeated labeling can be more effective than single labeling, especially when the cost of preprocessing is not free. It also introduces a robust technique that combines different measures of uncertainty to select data points for which quality should be improved. The results show that selective acquisition of multiple labels can significantly enhance model performance, particularly in scenarios with low label quality. The paper also compares different labeling strategies, including round-robin, uncertainty-preserving, and model-based approaches, demonstrating that combining label and model uncertainty leads to the best performance. The study highlights the importance of considering both labeling and model uncertainty in data mining tasks, and suggests that future work should explore more sophisticated strategies for handling noisy labelers and varying costs.This paper addresses the problem of acquiring multiple, noisy labels for data items when labeling is imperfect. It examines how repeated labeling can improve data quality and model performance, especially in supervised learning. With the increasing availability of low-cost labeling through platforms like Mechanical Turk, obtaining non-expert labels has become feasible. However, the cost of labeling can be significant compared to the cost of preprocessing unlabeled data. The paper presents strategies for repeated labeling, showing that even with noisy labels, repeated labeling can be more effective than single labeling, especially when the cost of preprocessing is not free. It also introduces a robust technique that combines different measures of uncertainty to select data points for which quality should be improved. The results show that selective acquisition of multiple labels can significantly enhance model performance, particularly in scenarios with low label quality. The paper also compares different labeling strategies, including round-robin, uncertainty-preserving, and model-based approaches, demonstrating that combining label and model uncertainty leads to the best performance. The study highlights the importance of considering both labeling and model uncertainty in data mining tasks, and suggests that future work should explore more sophisticated strategies for handling noisy labelers and varying costs.