Understanding Snorkel%3A Rapid Training Data Creation with Weak Supervision

Snorkel is a novel system designed to enable users to train advanced machine learning models without the need for hand-labeled training data. Instead, users write labeling functions that express arbitrary heuristics, which can have unknown accuracies and correlations. Snorkel denoises these outputs by incorporating a machine learning paradigm called data programming, which estimates the accuracies and correlations of the labeling functions without access to ground truth. The system includes a flexible interface for writing labeling functions and a generative model to combine the outputs of these functions into probabilistic labels. Snorkel has been evaluated through user studies and real-world deployments, showing that it can build models 2.8 times faster and with an average 45.5% increase in predictive performance compared to hand-labeled data. Additionally, Snorkel provides a trade-off optimizer that decides when to model the accuracies of labeling functions, improving both speed and performance. In two collaborations with government agencies and on four open-source datasets, Snorkel achieved an average 132% improvement in predictive performance over heuristic approaches and came within 3.60% of the performance of large hand-curated training sets.Snorkel is a novel system designed to enable users to train advanced machine learning models without the need for hand-labeled training data. Instead, users write labeling functions that express arbitrary heuristics, which can have unknown accuracies and correlations. Snorkel denoises these outputs by incorporating a machine learning paradigm called data programming, which estimates the accuracies and correlations of the labeling functions without access to ground truth. The system includes a flexible interface for writing labeling functions and a generative model to combine the outputs of these functions into probabilistic labels. Snorkel has been evaluated through user studies and real-world deployments, showing that it can build models 2.8 times faster and with an average 45.5% increase in predictive performance compared to hand-labeled data. Additionally, Snorkel provides a trade-off optimizer that decides when to model the accuracies of labeling functions, improving both speed and performance. In two collaborations with government agencies and on four open-source datasets, Snorkel achieved an average 132% improvement in predictive performance over heuristic approaches and came within 3.60% of the performance of large hand-curated training sets.

Snorkel: Rapid Training Data Creation with Weak Supervision

2017 | Alexander Ratner, Stephen H. Bach, Henry Ehrenberg, Jason Fries, Sen Wu, Christopher Ré