2017 | Alexander Ratner, Stephen H. Bach, Henry Ehrenberg, Jason Fries, Sen Wu, Christopher Ré
Snorkel is a system that enables users to train high-performance machine learning models without manual labeling. Instead, users define labeling functions that express weak supervision sources, which can be noisy and correlated. Snorkel uses data programming to automatically learn the accuracy and correlation structure of these sources, generating probabilistic labels for training. This approach allows users to build models 2.8× faster and improve predictive performance by 45.5% compared to manual labeling. Snorkel has been tested in collaborations with the U.S. Department of Veterans Affairs and the U.S. Food and Drug Administration, and on four open-source datasets, achieving 132% average improvements in predictive performance over prior heuristic methods and coming within 3.60% of the performance of large hand-curated training sets.
Snorkel's architecture includes three main stages: writing labeling functions, modeling accuracies and correlations, and training a discriminative model. Labeling functions are written in Python or using declarative operators, allowing users to express various weak supervision sources. Snorkel automatically learns a generative model to estimate the accuracy and correlation of labeling functions, which are then used to train a discriminative model. This approach enables Snorkel to improve predictive performance while reducing training time.
Snorkel's system allows users to combine multiple weak supervision sources to create training data. It uses a probabilistic approach to generate labels, which are then used to train a wide range of machine learning models. Snorkel's generative model is a re-weighted combination of the user-provided labeling functions, which tend to be precise but low-coverage. Modern discriminative models can retain this precision while learning to generalize beyond the labeling functions, increasing coverage and robustness on unseen data.
Snorkel's system has been evaluated in real-world deployments and on open-source datasets, showing significant improvements in predictive performance compared to traditional methods. It has been used for tasks such as knowledge base construction, image analysis, bioinformatics, and fraud detection. Snorkel's ability to automatically select the optimal level of complexity for modeling correlations and accuracies has been shown to improve predictive performance while reducing computational cost. The system's user study demonstrated that Snorkel is efficient and easy to use, with participants being able to write labeling functions and match or outperform models trained on hand-labeled data.Snorkel is a system that enables users to train high-performance machine learning models without manual labeling. Instead, users define labeling functions that express weak supervision sources, which can be noisy and correlated. Snorkel uses data programming to automatically learn the accuracy and correlation structure of these sources, generating probabilistic labels for training. This approach allows users to build models 2.8× faster and improve predictive performance by 45.5% compared to manual labeling. Snorkel has been tested in collaborations with the U.S. Department of Veterans Affairs and the U.S. Food and Drug Administration, and on four open-source datasets, achieving 132% average improvements in predictive performance over prior heuristic methods and coming within 3.60% of the performance of large hand-curated training sets.
Snorkel's architecture includes three main stages: writing labeling functions, modeling accuracies and correlations, and training a discriminative model. Labeling functions are written in Python or using declarative operators, allowing users to express various weak supervision sources. Snorkel automatically learns a generative model to estimate the accuracy and correlation of labeling functions, which are then used to train a discriminative model. This approach enables Snorkel to improve predictive performance while reducing training time.
Snorkel's system allows users to combine multiple weak supervision sources to create training data. It uses a probabilistic approach to generate labels, which are then used to train a wide range of machine learning models. Snorkel's generative model is a re-weighted combination of the user-provided labeling functions, which tend to be precise but low-coverage. Modern discriminative models can retain this precision while learning to generalize beyond the labeling functions, increasing coverage and robustness on unseen data.
Snorkel's system has been evaluated in real-world deployments and on open-source datasets, showing significant improvements in predictive performance compared to traditional methods. It has been used for tasks such as knowledge base construction, image analysis, bioinformatics, and fraud detection. Snorkel's ability to automatically select the optimal level of complexity for modeling correlations and accuracies has been shown to improve predictive performance while reducing computational cost. The system's user study demonstrated that Snorkel is efficient and easy to use, with participants being able to write labeling functions and match or outperform models trained on hand-labeled data.